# Contents

R version: R version 4.2.2 (2022-10-31)

Bioconductor version: 3.16

Package: 1.22.0

# 1 Abstract

Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results. This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data.

# 2 Introduction

Gene set enrichment analysis allows researchers to efficiently extract biological insights from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of gene set enrichment (GSE) analysis methods released through the Bioconductor project (Huber et al. 2015) together with a steady increase in the number of gene set collections available through online databases such as MSigDB (Subramanian et al. 2005), GeneSetDB (Araki et al. 2012) and KEGG (Kanehisa and Goto 2000).

In an effort to unify these computational methods and knowledge-bases, the EGSEA R/Bioconductor package was developed. EGSEA, which stands for Ensemble of Gene Set Enrichment Analyses (Alhamdoosh et al. 2017) combines the results from multiple algorithms to arrive at a consensus gene set ranking to identify biological themes and pathways perturbed in an experiment. EGSEA calculates seven statistics to combine the individual gene set statistics of base GSE methods to rank biologically relevant gene sets. The current version of the EGSEA package (Alhamdoosh, Ng, and Ritchie 2017) utilizes the analysis results of up to twelve prominent GSE algorithms that include: ora (Tavazoie et al. 1999), globaltest (Goeman et al. 2004), plage (Tomfohr, Lu, and Kepler 2005), safe (Barry, Nobel, and Wright 2005), zscore (Lee et al. 2008), gage (Luo et al. 2009), ssgsea (Barbie et al. 2009), padog (Tarca et al. 2012), gsva (Hänzelmann, Castelo, and Guinney 2013), camera (Wu and Smyth 2012), roast (Wu et al. 2010) and fry (Wu et al. 2010). The ora, gage, camera and gsva methods depend on a competitive null hypothesis which assumes the genes in a set do not have a stronger association with the experimental condition compared to randomly chosen genes outside the set. The remaining eight methods are based on a self-contained null hypothesis that only considers genes within a set and again assumes that they have no association with the experimental condition.

EGSEA provides access to a diverse range of gene signature collections through the EGSEAdata package that includes more than 25,000 gene sets for human and mouse organised according to their database sources (Table 1). For example, MSigDB (Subramanian et al. 2005) includes a number of collections (Hallmark (h) and c1-c7) that explore different biological themes ranging from very broad (h, c2, c5) through to more specialised ones focusing on cancer (c4, c6) and immunology (c7). The other main sources are GeneSetDB (Araki et al. 2012) and KEGG (Kanehisa and Goto 2000) which have similar collections focusing on different biological characteristics (Table 1). The choice of collection/s in any given analysis should of course be guided by the biological question of interest. The MSigDB c2 and c5 collections are the most widely used in our own analysis practice, spanning a wide range of biological processes and can often reveal new biological insights when applied to a given dataset.

|————–|———————–|——————————————————–| | Database | Collection | Description | |————–|———————–|——————————————————–| | MSigDB | h Hallmarks | Gene sets representing well-defined biological states. | | | c1 Positional | Gene sets by chromosome and cytogenetic band. | | | c2 Curated | Gene sets obtained from a variety of sources, | | | | including online pathway databases | | | | and the biomedical literature. | | | c3 Motif | Gene sets of potential targets regulated by | | | | transcription factors or microRNAs. | | | c4 Computational | Gene sets defined computationally by mining | | | | large collections of cancer-oriented microarray data. | | | c5 GO | Gene sets annotated by Gene Ontology (GO) terms. | | | c6 Oncogenic | Gene sets of the major cellular pathways | | | | disrupted in cancer. | | | c7 Immunologic | Gene sets representing different cell types | | | | and stimulations relevant to the immune system. | |————–|———————–|——————————————————–| | KEGG | Signalling | | | | Disease | Gene sets obtained from the KEGG database. | | | Metabolic | | |————–|———————–|——————————————————–| | GeneSetDB | Pathway | | | | Disease | | | | Drug | Gene sets obtained from various online databases. | | | Regulation | | | | GO Term | | |————–|———————–|——————————————————–|

The purpose of this article is to demonstrate the gene set testing workflow available in EGSEA on both RNA-seq and microarray data. Each analysis involves four major steps that are summarized in Figure 1: (1) selecting appropriate gene set collections for analysis and building an index that maps between the members of each set and the expression matrix; (2) choosing the base GSE methods to combine and the ranking options; (3) running the EGSEA test and (4) reporting results in various ways to share with collaborators. The EGSEA functions involved in each of these steps are introduced with code examples to demonstrate to how they can be deployed as part of a limma differential expression analysis to help with the interpretation of results.

# 3 Gene expression profiling of the mouse mammary gland

The first experiment analysed in this workflow is an RNA-seq dataset from Sheridan et al. (2015) (Sheridan et al. 2015) that consists of 3 cell populations (Basal, Luminal Progenitor (LP) and Mature Luminal (ML)) sorted from the mammary glands of female virgin mice. Triplicate RNA samples from each population were obtained in 3 batches and sequenced on an Illumina HiSeq 2000 using a 100 base-pair single-ended protocol. Raw sequence reads from the fastq files were aligned to the mouse reference genome (mm10) using the Rsubread package (Liao, Smyth, and Shi 2013). Next, gene-level counts were obtained using featureCounts (Liao, Smyth, and Shi 2014) based on Rsubread’s built-in mm10 RefSeq-based annotation.
The raw data along with further information on experimental design and sample preparation can be downloaded from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) using GEO Series accession number GSE63310 and will be preprocessed according to the RNA-seq workflow published by Law et al. (2016) (Law et al. 2016).

The second experiment analysed in this workflow comes from Lim et al. (2010) (Lim et al. 2010) and is the microarray equivalent of the RNA-seq dataset mentioned above. The same 3 populations (Basal (also referred to as MaSC-enriched), LP and ML) were sorted from mouse mammary glands via flow cytometry. Total RNA from 5 replicates of each cell population were hybridised onto 3 Illumina MouseWG-6 v2 BeadChips. The intensity files and chip annotation file available in Illumina’s proprietary formats (IDAT and BGX respectively) can be downloaded from http://bioinf.wehi.edu.au/EGSEA/arraydata.zip. The raw data from this experiment is also available from GEO under Series accession number GSE19446.

## 3.1 Analysis of RNA-seq data with EGSEA

Our RNA-seq analysis follows on directly from the workflow of Law et al. (2016) which performs a differential gene expression analysis on this dataset using the Bioconductor packages edgeR (Robinson, McCarthy, and Smyth 2010), limma (Ritchie et al. 2015) and Glimma (Su et al. 2017) with gene annotation from Mus.musculus (Bioconductor Core Team 2015). The limma package offers a well-developed suite of statistical methods for dealing with differential expression for both microarray and RNA-seq datasets and will be used in the analyses of both datasets presented in this workflow.

### 3.1.1 Reading, preprocessing and normalisation of RNA-seq data

To get started with this analysis, download the R data file from http://bioinf.wehi.edu.au/EGSEA/mam.rnaseq.rdata. The code below loads the preprocessed count matrix from Law et al. (2016), performs TMM normalisation (Robinson and Oshlack 2010) on the raw counts, and calculates voom weights for use in comparisons of gene expression between Basal and LP, Basal and ML, and LP and ML populations.

## [1] "samples" "counts"  "genes"
## [1] 14165     9
##   Basal LP ML L006 L008
## 1     0  1  0    0    0
## 2     0  0  1    0    0
## 3     1  0  0    0    0
## 4     1  0  0    1    0
## 5     0  0  1    1    0
## 6     0  1  0    1    0
##        Contrasts
## Levels  BasalvsLP BasalvsML LPvsML
##   Basal         1         1      0
##   LP           -1         0      1
##   ML            0        -1     -1
##   L006          0         0      0
##   L008          0         0      0

The voom function (Law et al. 2014) from the limma package converts counts to log-counts-per-million (log-cpm) and calculates observation-level precision weights. The voom object (v) contains normalized log-cpm values and gene information used by all of the methods in the EGSEA analysis below. The precision weights stored within v are also used by the camera, roast and fry gene set testing methods.

v = voom(x, design, plot=FALSE)
names(v)
## [1] "genes"   "targets" "E"       "weights" "design"

For further information on preprocessing, see Law et al. (2016) as a detailed explanation of these steps is beyond the scope of this article.

### 3.1.2 Gene set testing

The EGSEA algorithm makes use of the voom object (v), a design matrix (design) and an optional contrasts matrix (contr.matrix). The design matrix describes how the samples in the experiment relate to the coefficients estimated by the linear model (Smyth 2004). The contrasts matrix then compares two or more of these coefficients to allow relative assessment of differential expression. Base methods that utilize linear models such as those from limma and GSVA (gsva, plage, zscore and ssgsea) make use of the design and contrasts matrices directly. For methods that do not support linear models, these two matrices are used to extract the group information for each comparison.

#### 3.1.2.1 1. Exploring, selecting and indexing gene set collections

The package EGSEAdata includes more than 25,000 gene sets organized in collections depending on their database sources. Summary information about the gene set collections available in EGSEAdata can be displayed as follows:

library(EGSEAdata)
egsea.data("mouse")
## The following databases are available in EGSEAdata for Mus musculus:
##
## Database name: KEGG Pathways
## Version: NA
## Data source: gage::kegg.gsets()
## Supported species: human, mouse, rat
## Gene set collections: Signaling, Metabolism, Disease
## Related data objects: kegg.pathways
## Number of gene sets in each collection for  Mus musculus :
##  Signaling: 132
##  Metabolism: 89
##  Disease: 67
##
## Database name: Molecular Signatures Database (MSigDB)
## Version: 5.2
## Supported species: human, mouse
## Gene set collections: h, c1, c2, c3, c4, c5, c6, c7
## Related data objects: msigdb, Mm.H, Mm.c2, Mm.c3, Mm.c4, Mm.c5, Mm.c6, Mm.c7
## Number of gene sets in each collection for  Mus musculus :
##  h Hallmark Signatures: 50
##  c2 Curated Gene Sets: 4729
##  c3 Motif Gene Sets: 836
##  c4 Computational Gene Sets: 858
##  c5 GO Gene Sets: 6166
##  c6 Oncogenic Signatures: 189
##  c7 Immunologic Signatures: 4872
##
## Database name: GeneSetDB Database
## Version: NA
## Data source: http://www.genesetdb.auckland.ac.nz/
## Supported species: human, mouse, rat
## Gene set collections: gsdbdis, gsdbgo, gsdbdrug, gsdbpath, gsdbreg
## Related data objects: gsetdb.human, gsetdb.mouse, gsetdb.rat
## Number of gene sets in each collection for  Mus musculus :
##  GeneSetDB Drug/Chemical: 6019
##  GeneSetDB Disease/Phenotype: 5077
##  GeneSetDB Gene Ontology: 2202
##  GeneSetDB Pathway: 1444
##  GeneSetDB Gene Regulation: 201
##
## Type ?<data object name> to get a specific information
##              about it, e.g., ?kegg.pathways.

As the output above suggests, users can obtain help on any of the collections using the standard R help (?) command, for instance ?Mm.c2 will return more information on the mouse version of the c2 collection from MSigDB. The above information can be returned as a list:

info = egsea.data("mouse", returnInfo = TRUE)
names(info)
## [1] "kegg"   "msigdb" "gsetdb"
info$msigdb$info$collections ## [1] "h" "c1" "c2" "c3" "c4" "c5" "c6" "c7" To highlight the capabilities of the EGSEA package, the KEGG pathways, c2 (curated gene sets) and c5 (Gene Ontology gene sets) collections from the MSigDB database are selected. Next, an index is built for each gene set collection using the EGSEA indexing functions to link the genes in the different gene set collections to the rows of our RNA-seq gene expression matrix. Indexes for the c2 and c5 collections from MSigDB and for the KEGG pathways are built using the buildIdx function which relies on Entrez gene IDs as its key. In the EGSEAdata gene set collections, Entrez IDs are used as they are widely adopted by the different source databases and tend to be more consistent and robust since there is one identifier per gene in a gene set. It is also relatively easy to convert other gene IDs into Entrez IDs. library(EGSEA) gs.annots = buildIdx(entrezIDs=v$genes$ENTREZID, species="mouse", msigdb.gsets=c("c2", "c5"), go.part = TRUE) ## [1] "Loading MSigDB Gene Sets ... " ## [1] "Loaded gene sets for the collection c2 ..." ## [1] "Indexed the collection c2 ..." ## [1] "Created annotation for the collection c2 ..." ## [1] "Loaded gene sets for the collection c5 ..." ## [1] "Indexed the collection c5 ..." ## [1] "Created annotation for the collection c5 ..." ## MSigDB c5 gene set collection has been partitioned into ## c5BP, c5CC, c5MF ## [1] "Building KEGG pathways annotation object ... " names(gs.annots) ## [1] "c2" "c5BP" "c5CC" "c5MF" "kegg" To obtain additional information on the gene set collection indexes, including the total number of gene sets, the version number and date of last revision, the methods summary, show and getSetByName (or getSetByID) can be invoked on an object of class GSCollectionIndex, which stores all of the relevant gene set information, as follows: class(gs.annots$c2)
## [1] "GSCollectionIndex"
## attr(,"package")
## [1] "EGSEA"
summary(gs.annots$c2) ## c2 Curated Gene Sets (c2): 4726 gene sets - Version: 5.2, Update date: 07 March 2017 show(gs.annots$c2)
## An object of class "GSCollectionIndex"
##  Number of gene sets: 4726
##  Annotation columns: ID, GeneSet, BroadUrl, Description, PubMedID, NumGenes, Contributor
##  Total number of indexing genes: 14165
##  Species: Mus musculus
##  Collection name: c2 Curated Gene Sets
##  Collection uniqe label: c2
##  Database version: 5.2
##  Database update date: 07 March 2017
s = getSetByName(gs.annots$c2, "SMID_BREAST_CANCER_LUMINAL_A_DN") ## ID: M13072 ## GeneSet: SMID_BREAST_CANCER_LUMINAL_A_DN ## BroadUrl: http://www.broadinstitute.org/gsea/msigdb/cards/SMID_BREAST_CANCER_LUMINAL_A_DN.html ## Description: Genes down-regulated in the luminal A subtype of breast cancer. ## PubMedID: 18451135 ## NumGenes: 23/24 ## Contributor: Jessica Robertson class(s) ## [1] "list" names(s) ## [1] "SMID_BREAST_CANCER_LUMINAL_A_DN" names(s$SMID_BREAST_CANCER_LUMINAL_A_DN)
## [1] "ID"          "GeneSet"     "BroadUrl"    "Description" "PubMedID"
## [6] "NumGenes"    "Contributor" "idx"         "original"

Objects of class GSCollectionIndex store for each gene set the Entrez gene IDs in the slot original, the indexes in the slot idx and additional annotation for each set in the slot anno.

slotNames(gs.annots$c2) ## [1] "original" "idx" "anno" "featureIDs" "species" ## [6] "name" "label" "version" "date" Other EGSEA functions such as buildCustomIdx, buildGMTIdx, buildKEGGIdx, buildMSigDBIdx and buildGeneSetDBIdx can be also used to build gene set collection indexes. The functions buildCustomIdx and buildGMTIdx were written to allow users to run EGSEA on gene set collections that may have been curated within a lab or downloaded from public databases and allow use of gene identifiers other than Entrez IDs. Example databases include, ENCODE Gene Set Hub (available from https://sourceforge.net/projects/encodegenesethub/), which is a growing resource of gene sets derived from high quality ENCODE profiling experiments encompassing hundreds of DNase hypersensitivity, histone modification and transcription factor binding experiments (Ziemann et al. 2017). Other resources include PathwayCommons (http://www.pathwaycommons.org/) (Cerami et al. 2011) or the KEGGREST (Tenenbaum 2017) package that provides access to up-to-date KEGG pathways across many species. #### 3.1.2.2 2. Configuring EGSEA Before an EGSEA test is carried out, a few parameters need to be specified. First, a mapping between Entrez IDs and Gene Symbols is created for use by the visualization procedures. This mapping can be extracted from the genes data.frame of the voom object as follows: colnames(v$genes)
## [1] "ENTREZID" "SYMBOL"   "CHR"
symbolsMap = v$genes[, c(1, 2)] colnames(symbolsMap) = c("FeatureID", "Symbols") symbolsMap[, "Symbols"] = as.character(symbolsMap[, "Symbols"]) Another important parameter in EGSEA is the list of base GSE methods (baseMethods in the code below), which determines the individual algorithms that are used in the ensemble testing. The supported base methods can be listed using the function egsea.base as follows: egsea.base() ## [1] "camera" "roast" "safe" "gage" "padog" ## [6] "plage" "zscore" "gsva" "ssgsea" "globaltest" ## [11] "ora" "fry" The plage, zscore and ssgsea algorithms are available in the GSVA package and camera, fry and roast are implemented in the limma package (Ritchie et al. 2015). The ora method is implemented using the phyper function from the stats package (R Core Team 2017), which estimates the hypergeometric distribution for a $$2 \times 2$$ contingency table. The remaining algorithms are implemented in Bioconductor packages of the same name. A wrapper function is provided for each individual GSE method to utilize this existing R code and create a universal interface for all methods. Eleven base methods are selected for our EGSEA analysis: camera, safe, gage, padog, plage, zscore, gsva, ssgsea, globaltest, ora and fry. Fry is a fast approximation of roast that assumes equal gene-wise variances across samples to produce similar $$p$$-values to a roast analysis run with an infinite number of rotations, and is selected here to save time. baseMethods = egsea.base()[-2] baseMethods ## [1] "camera" "safe" "gage" "padog" "plage" ## [6] "zscore" "gsva" "ssgsea" "globaltest" "ora" ## [11] "fry" Although, different combinations of base methods might produce different results, it has been found via simulation that including more methods gives better performance (Alhamdoosh et al. 2017). Since each base method generates different $$p$$-values, EGSEA supports six different methods from the metap package (Dewey 2017) for combining individual $$p$$-values (Wilkinson (Wilkinson 1954) is default), which can be listed as follows: egsea.combine() ## [1] "fisher" "wilkinson" "average" "logitp" "sump" "sumz" ## [7] "votep" "median" Finally, the sorting of EGSEA results plays an essential role in identifying relevant gene sets. Any of EGSEA’s combined scores or the rankings from individual base methods can be used for sorting the results. egsea.sort() ## [1] "p.value" "p.adj" "vote.rank" "avg.rank" ## [5] "med.rank" "min.pvalue" "min.rank" "avg.logfc" ## [9] "avg.logfc.dir" "direction" "significance" "camera" ## [13] "roast" "safe" "gage" "padog" ## [17] "plage" "zscore" "gsva" "ssgsea" ## [21] "globaltest" "ora" "fry" Although p.adj is the default option for sorting EGSEA results for convenience, we recommend the use of either med.rank or vote.rank because they efficiently utilize the rankings of individual methods and tend to produce fewer false positives (Alhamdoosh et al. 2017). #### 3.1.2.3 3. Ensemble testing with EGSEA Next, the EGSEA analysis is performed using the egsea function that takes a voom object, a contrasts matrix, collections of gene sets and other run parameters as follows: gsa = egsea(voom.results=v, contrasts=contr.matrix, gs.annots=gs.annots, symbolsMap=symbolsMap, baseGSEAs=baseMethods, sort.by="med.rank", num.threads = 16, report = FALSE) ## EGSEA analysis has started ## ##------ Wed Feb 1 13:27:11 2023 ------## ## Log fold changes are estimated using limma package ... ## limma DE analysis is carried out ... ## EGSEA is running on the provided data and c2 collection ##  ## EGSEA is running on the provided data and c5BP collection ##  ## EGSEA is running on the provided data and c5CC collection ##  ## EGSEA is running on the provided data and c5MF collection ##  ## EGSEA is running on the provided data and kegg collection ##  ## ##------ Wed Feb 1 13:28:39 2023 ------## ## EGSEA analysis took 88.27 seconds. ## EGSEA analysis has completed In situations where the design matrix includes an intercept, a vector of integers that specify the columns of the design matrix to test using EGSEA can be passed to the contrasts argument. If this parameter is NULL, all pairwise comparisons based on v$targets\$group are created, assuming that group is the primary factor in the design matrix. Likewise, all the coefficients of the primary factor are used if the design matrix has an intercept.

EGSEA is implemented with parallel computing features enabled using the parallel package (R Core Team 2017) at both the method-level and experimental contrast-level. The running time of the EGSEA test depends on the base methods selected and whether report generation is enabled or not. The latter significantly increases the run time, particularly if the argument display.top is assigned a large value ($$>$$ 20) and/or a large number of gene set collections are selected. EGSEA reporting functionality generates set-level plots for the top gene sets as well as collection-level plots.

The EGSEA package also has a function named egsea.cnt, that can perform the EGSEA test using an RNA-seq count matrix rather than a voom object, a function named egsea.ora, that can perform over-representation analysis with EGSEA reporting capabilities using only a vector of gene IDs, and the egsea.ma function that can perform EGSEA testing using a microarray expression matrix as shown later in the workflow.

##### 3.1.2.3.1 Classes used to manage the results

The output of the functions egsea, egsea.cnt, egsea.ora and egsea.ma is an S4 object of class EGSEAResults. Several S4 methods can be invoked to query this object. For example, an overview of the EGSEA analysis can be displayed using the show method as follows:

show(gsa)
## An object of class "EGSEAResults"
##  Total number of genes: 14165
##  Total number of samples: 9
##  Contrasts: BasalvsLP, BasalvsML, LPvsML
##  Base GSE methods: camera (limma:3.54.1), safe (safe:3.38.0), gage (gage:2.48.0), padog (PADOG:1.40.0), plage (GSVA:1.46.0), zscore (GSVA:1.46.0), gsva (GSVA:1.46.0), ssgsea (GSVA:1.46.0), globaltest (globaltest:5.52.0), ora (stats:4.2.2), fry (limma:3.54.1)
##  P-values combining method: wilkinson
##  Sorting statistic: med.rank
##  Organism: Mus musculus
##  HTML report generated: No
##  Tested gene set collections:
##      c2 Curated Gene Sets (c2): 4726 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (BP) (c5BP): 4653 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (CC) (c5CC): 584 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (MF) (c5MF): 928 gene sets - Version: 5.2, Update date: 07 March 2017
##      KEGG Pathways (kegg): 287 gene sets - Version: NA, Update date: 07 March 2017
##  EGSEA version: 1.26.0
## Use summary(object) and topSets(object, ...) to explore this object.

This command displays the number of genes and samples that were included in the analysis, the experimental contrasts, base GSE methods, the method used to combine the $$p$$-values derived from different GSE algorithms, the sorting statistic used and the size of each gene set collection. Note that the gene set collections are identified using the labels that appear in parentheses (e.g. c2) in the output of show.

#### 3.1.2.4 4. Reporting EGSEA results

##### 3.1.2.4.1 Getting top ranked gene sets

A summary of the top 10 gene sets in each collection for each contrast in addition to the EGSEA comparative analysis can be displayed using the S4 method summary as follows:

summary(gsa)
## **** Top 10 gene sets in the c2 Curated Gene Sets collection ****
## ** Contrast BasalvsLP **
## LIM_MAMMARY_STEM_CELL_DN | LIM_MAMMARY_LUMINAL_PROGENITOR_UP
## MONTERO_THYROID_CANCER_POOR_SURVIVAL_UP | SMID_BREAST_CANCER_LUMINAL_A_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS
## REACTOME_TRANSFERRIN_ENDOCYTOSIS_AND_RECYCLING | FARMER_BREAST_CANCER_CLUSTER_2
## KEGG_EPITHELIAL_CELL_SIGNALING_IN_HELICOBACTER_PYLORI_INFECTION | LANDIS_BREAST_CANCER_PROGRESSION_UP
##
## ** Contrast BasalvsML **
## LIM_MAMMARY_STEM_CELL_DN | LIM_MAMMARY_STEM_CELL_UP
## LIM_MAMMARY_LUMINAL_MATURE_DN | PAPASPYRIDONOS_UNSTABLE_ATEROSCLEROTIC_PLAQUE_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | LIM_MAMMARY_LUMINAL_MATURE_UP
## YAGUE_PRETUMOR_DRUG_RESISTANCE_DN | BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN
##
## ** Contrast LPvsML **
## LIM_MAMMARY_LUMINAL_MATURE_UP | LIM_MAMMARY_LUMINAL_MATURE_DN
## PHONG_TNF_RESPONSE_VIA_P38_PARTIAL | WOTTON_RUNX_TARGETS_UP
## WANG_MLL_TARGETS | PHONG_TNF_TARGETS_DN
## REACTOME_PEPTIDE_LIGAND_BINDING_RECEPTORS | CHIANG_LIVER_CANCER_SUBCLASS_CTNNB1_DN
## GERHOLD_RESPONSE_TO_TZD_DN | DURAND_STROMA_S_UP
##
## ** Comparison analysis **
## LIM_MAMMARY_LUMINAL_MATURE_DN | LIM_MAMMARY_STEM_CELL_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | LIM_MAMMARY_LUMINAL_MATURE_UP
## COLDREN_GEFITINIB_RESISTANCE_DN | LIM_MAMMARY_STEM_CELL_UP
## CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP | LIM_MAMMARY_LUMINAL_PROGENITOR_UP
## BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN | MIKKELSEN_IPS_WITH_HCP_H3K27ME3
##
## **** Top 10 gene sets in the c5 GO Gene Sets (BP) collection ****
## ** Contrast BasalvsLP **
## GO_SYNAPSE_ORGANIZATION | GO_IRON_ION_TRANSPORT
## GO_FERRIC_IRON_TRANSPORT | GO_TRIVALENT_INORGANIC_CATION_TRANSPORT
## GO_NEURON_PROJECTION_GUIDANCE | GO_MESONEPHROS_DEVELOPMENT
##
## ** Contrast BasalvsML **
## GO_FERRIC_IRON_TRANSPORT | GO_TRIVALENT_INORGANIC_CATION_TRANSPORT
## GO_IRON_ION_TRANSPORT | GO_NEURON_PROJECTION_GUIDANCE
## GO_GLIAL_CELL_MIGRATION | GO_SPINAL_CORD_DEVELOPMENT
## GO_REGULATION_OF_SYNAPSE_ORGANIZATION | GO_ACTION_POTENTIAL
## GO_MESONEPHROS_DEVELOPMENT | GO_NEGATIVE_REGULATION_OF_SMOOTH_MUSCLE_CELL_MIGRATION
##
## ** Contrast LPvsML **
## GO_NEGATIVE_REGULATION_OF_NECROTIC_CELL_DEATH | GO_PARTURITION
## GO_RESPONSE_TO_VITAMIN_D | GO_GPI_ANCHOR_METABOLIC_PROCESS
## GO_REGULATION_OF_BLOOD_PRESSURE | GO_DETECTION_OF_MOLECULE_OF_BACTERIAL_ORIGIN
## GO_INTRACILIARY_TRANSPORT | GO_CELLULAR_RESPONSE_TO_VITAMIN
##
## ** Comparison analysis **
## GO_IRON_ION_TRANSPORT | GO_FERRIC_IRON_TRANSPORT
## GO_TRIVALENT_INORGANIC_CATION_TRANSPORT | GO_NEURON_PROJECTION_GUIDANCE
## GO_MESONEPHROS_DEVELOPMENT | GO_SYNAPSE_ORGANIZATION
##
## **** Top 10 gene sets in the c5 GO Gene Sets (CC) collection ****
## ** Contrast BasalvsLP **
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_VACUOLAR_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX
## GO_MICROTUBULE_END | GO_MICROTUBULE_PLUS_END
## GO_NEUROMUSCULAR_JUNCTION | GO_INTERMEDIATE_FILAMENT
##
## ** Contrast BasalvsML **
## GO_FILOPODIUM_MEMBRANE | GO_LATE_ENDOSOME_MEMBRANE
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_NEUROMUSCULAR_JUNCTION
## GO_COATED_MEMBRANE | GO_ACTIN_FILAMENT_BUNDLE
##
## ** Contrast LPvsML **
## GO_CILIARY_TRANSITION_ZONE | GO_TCTN_B9D_COMPLEX
## GO_NUCLEAR_NUCLEOSOME | GO_INTRINSIC_COMPONENT_OF_ORGANELLE_MEMBRANE
## GO_ENDOPLASMIC_RETICULUM_QUALITY_CONTROL_COMPARTMENT | GO_KERATIN_FILAMENT
## GO_PROTEASOME_COMPLEX | GO_PERIKARYON
## GO_CILIARY_BASAL_BODY | GO_PROTEASOME_CORE_COMPLEX
##
## ** Comparison analysis **
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_ACTIN_FILAMENT_BUNDLE
## GO_CONTRACTILE_FIBER | GO_INTERMEDIATE_FILAMENT
## GO_LATE_ENDOSOME_MEMBRANE | GO_CLATHRIN_VESICLE_COAT
## GO_ENDOPLASMIC_RETICULUM_QUALITY_CONTROL_COMPARTMENT | GO_MICROTUBULE_END
##
## **** Top 10 gene sets in the c5 GO Gene Sets (MF) collection ****
## ** Contrast BasalvsLP **
## GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY | GO_SIGNALING_PATTERN_RECOGNITION_RECEPTOR_ACTIVITY
## GO_LIPID_TRANSPORTER_ACTIVITY | GO_TRIGLYCERIDE_LIPASE_ACTIVITY
## GO_AMINE_BINDING | GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE
## GO_NEUROPEPTIDE_RECEPTOR_ACTIVITY | GO_WIDE_PORE_CHANNEL_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_LIPASE_ACTIVITY
##
## ** Contrast BasalvsML **
## GO_G_PROTEIN_COUPLED_RECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_KINASE_ACTIVITY
## GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE | GO_VOLTAGE_GATED_SODIUM_CHANNEL_ACTIVITY
## GO_CORECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_TYROSINE_KINASE_ACTIVITY
## GO_LIPID_TRANSPORTER_ACTIVITY | GO_SULFOTRANSFERASE_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
##
## ** Contrast LPvsML **
## GO_MANNOSE_BINDING | GO_PHOSPHORIC_DIESTER_HYDROLASE_ACTIVITY
## GO_BETA_1_3_GALACTOSYLTRANSFERASE_ACTIVITY | GO_COMPLEMENT_BINDING
## GO_LIGASE_ACTIVITY_FORMING_CARBON_NITROGEN_BONDS | GO_CARBOHYDRATE_PHOSPHATASE_ACTIVITY
## GO_LIPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
##
## ** Comparison analysis **
## GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE | GO_LIPID_TRANSPORTER_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_CHEMOREPELLENT_ACTIVITY
## GO_HEPARAN_SULFATE_PROTEOGLYCAN_BINDING | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_TYROSINE_KINASE_ACTIVITY
## GO_LIPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
## GO_CORECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_KINASE_ACTIVITY
##
## **** Top 10 gene sets in the KEGG Pathways collection ****
## ** Contrast BasalvsLP **
## Collecting duct acid secretion | alpha-Linolenic acid metabolism
## Axon guidance | Synaptic vesicle cycle
## Hepatitis C | Vascular smooth muscle contraction
## Rheumatoid arthritis | cGMP-PKG signaling pathway
## Progesterone-mediated oocyte maturation | Arrhythmogenic right ventricular cardiomyopathy (ARVC)
##
## ** Contrast BasalvsML **
## Collecting duct acid secretion | Synaptic vesicle cycle
## Other glycan degradation | Arrhythmogenic right ventricular cardiomyopathy (ARVC)
## Glycerophospholipid metabolism | Lysosome
## Vascular smooth muscle contraction | Axon guidance
## Protein digestion and absorption | Oxytocin signaling pathway
##
## ** Contrast LPvsML **
## Glycosylphosphatidylinositol(GPI)-anchor biosynthesis | Drug metabolism - cytochrome P450
## Histidine metabolism | PI3K-Akt signaling pathway
## Sulfur metabolism | Proteasome
## Renin-angiotensin system | Nitrogen metabolism
## Tyrosine metabolism | Systemic lupus erythematosus
##
## ** Comparison analysis **
## Collecting duct acid secretion | Synaptic vesicle cycle
## Axon guidance | Vascular smooth muscle contraction
## Arrhythmogenic right ventricular cardiomyopathy (ARVC) | Oxytocin signaling pathway
## Lysosome | Adrenergic signaling in cardiomyocytes
## Linoleic acid metabolism | cGMP-PKG signaling pathway

EGSEA’s comparative analysis allows researchers to estimate the significance of a gene set across multiple experimental contrasts. This analysis helps in the identification of biological processes that are perturbed in multiple experimental conditions simultaneously. This experiment is the RNA-seq equivalent of Lim et al. (2010) (Lim et al. 2010), who used Illumina microarrays to study the same cell populations (see later), so it is reassuring to observe the LIM gene signatures derived from this experiment amongst the top ranked c2 gene signatures in both the individual contrasts and comparative results.

Another way of exploring the EGSEA results is to retrieve the top ranked $$N$$ sets in each collection and contrast using the method topSets. For example, the top 10 gene sets in the c2 collection for the comparative analysis can be retrieved as follows:

topSets(gsa, gs.label="c2", contrast = "comparison", names.only=TRUE)
## Extracting the top gene sets of the collection
## c2 Curated Gene Sets for the contrast comparison
##  Sorted by med.rank
##  [1] "LIM_MAMMARY_LUMINAL_MATURE_DN"
##  [2] "LIM_MAMMARY_STEM_CELL_DN"
##  [3] "NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP"
##  [4] "LIM_MAMMARY_LUMINAL_MATURE_UP"
##  [5] "COLDREN_GEFITINIB_RESISTANCE_DN"
##  [6] "LIM_MAMMARY_STEM_CELL_UP"
##  [7] "CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP"
##  [8] "LIM_MAMMARY_LUMINAL_PROGENITOR_UP"
##  [9] "BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN"
## [10] "MIKKELSEN_IPS_WITH_HCP_H3K27ME3"

The gene sets are ordered based on their med.rank score as selected when egsea was invoked above. When the argument names.only is set to FALSE, additional information is displayed for each gene set including gene set annotation, the EGSEA scores and the individual rankings by each base method. As expected, gene sets retrieved by EGSEA included the LIM gene sets (Lim et al. 2010) that were derived from microarray profiles of analagous mammary cell populations (sets 1, 2, 4, 6 and 8) as well as those derived from populations with similar origin (sets 7 and 9) and behaviour or characteristics (sets 5 and 10).

Next, topSets can be used to search for gene sets of interest based on different EGSEA scores as well as the rankings of individual methods. For example, the ranking of the six LIM gene sets from the c2 collection can be displayed based on the med.rank as follows:

t = topSets(gsa, contrast = "comparison",
names.only=FALSE, number = Inf, verbose = FALSE)
t[grep("LIM_", rownames(t)), c("p.adj", "Rank", "med.rank", "vote.rank")]
##                                           p.adj Rank med.rank vote.rank
## LIM_MAMMARY_LUMINAL_MATURE_DN      1.646053e-29    1       36         5
## LIM_MAMMARY_STEM_CELL_DN           6.082053e-43    2       37         5
## LIM_MAMMARY_LUMINAL_MATURE_UP      2.469061e-22    4       92         5
## LIM_MAMMARY_STEM_CELL_UP          3.154132e-103    6      134         5
## LIM_MAMMARY_LUMINAL_PROGENITOR_UP  3.871536e-30    8      180         5
## LIM_MAMMARY_LUMINAL_PROGENITOR_DN  2.033005e-06  178      636       115

While five of the LIM gene sets are ranked in the top 10 by EGSEA, the values shown in the median rank (med.rank) column indicate that individual methods can assign much lower ranks to these sets. EGSEA’s prioritisation of these gene sets demonstrates the benefit of an ensemble approach.

Similarly, we can find the top 10 pathways in the KEGG collection from the ensemble analysis for the Basal versus LP contrast and the comparative analysis as follows:

topSets(gsa, gs.label="kegg", contrast="BasalvsLP", sort.by="med.rank")
## Extracting the top gene sets of the collection
## KEGG Pathways for the contrast BasalvsLP
##  Sorted by med.rank
##  [1] "Collecting duct acid secretion"
##  [2] "alpha-Linolenic acid metabolism"
##  [3] "Axon guidance"
##  [4] "Synaptic vesicle cycle"
##  [5] "Hepatitis C"
##  [6] "Vascular smooth muscle contraction"
##  [7] "Rheumatoid arthritis"
##  [8] "cGMP-PKG signaling pathway"
##  [9] "Progesterone-mediated oocyte maturation"
## [10] "Arrhythmogenic right ventricular cardiomyopathy (ARVC)"
topSets(gsa, gs.label="kegg", contrast="comparison", sort.by="med.rank")
## Extracting the top gene sets of the collection
## KEGG Pathways for the contrast comparison
##  Sorted by med.rank
##  [1] "Collecting duct acid secretion"
##  [2] "Synaptic vesicle cycle"
##  [3] "Axon guidance"
##  [4] "Vascular smooth muscle contraction"
##  [5] "Arrhythmogenic right ventricular cardiomyopathy (ARVC)"
##  [6] "Oxytocin signaling pathway"
##  [7] "Lysosome"
##  [8] "Adrenergic signaling in cardiomyocytes"
##  [9] "Linoleic acid metabolism"
## [10] "cGMP-PKG signaling pathway"

EGSEA highlights many pathways with known importance in the mammary gland such as those associated with distinct roles in lactation like basal cell contraction (Vascular smooth muscle contraction and Oxytocin signalling pathway) and milk production and secretion from luminal lineage cells (Collecting duct acid secretion, Synaptic vesicle cycle and Lysosome).

##### 3.1.2.4.2 Visualizing results at the gene set level

Graphical representation of gene expression patterns within and between gene sets is an essential part of communicating the results of an analysis to collaborators and other researchers. EGSEA enables users to explore the elements of a gene set via a heatmap using the plotHeatmap method. Figure 2 shows examples for the LIM MAMMARY STEM CELL UP and LIM MAMMARY STEM CELL DN signatures which can be visualized across all contrasts using the code below.

plotHeatmap(gsa, gene.set="LIM_MAMMARY_STEM_CELL_UP", gs.label="c2",
contrast = "comparison", file.name = "hm_cmp_LIM_MAMMARY_STEM_CELL_UP", format="png")
## Generating heatmap for LIM_MAMMARY_STEM_CELL_UP from the collection
## c2 Curated Gene Sets and for the contrast comparison
plotHeatmap(gsa, gene.set="LIM_MAMMARY_STEM_CELL_DN", gs.label="c2",
contrast = "comparison", file.name = "hm_cmp_LIM_MAMMARY_STEM_CELL_DN", format="png")
## Generating heatmap for LIM_MAMMARY_STEM_CELL_DN from the collection
## c2 Curated Gene Sets and for the contrast comparison

When using plotHeatmap, the gene.set value must match the name returned from the topSets method. The rows of the heatmap represent the genes in the set and the columns represent the experimental contrasts. The heatmap colour-scale ranges from down-regulated (blue) to up-regulated (red) while the row labels (Gene symbols) are coloured in green when the genes are statistically significant in the DE analysis (i.e. FDR $$\leq$$ 0.05 in at least one contrast). Heatmaps can be generated for individual comparisons by changing the contrast argument of plotHeatmap. The plotHeatmap method also generates a CSV file that includes the DE analysis results from limma::topTable for all expressed genes in the selected gene set and for each contrast (in the case of contrast = "comparison"). This file can be used to create customised plots using other R/Bioconductor packages.

In addition to heatmaps, pathway maps can be generated for the KEGG gene sets using the plotPathway method which uses functionality from the pathview package (Luo and Brouwer 2013). For example, the first KEGG signalling pathway retrieved for the contrast BasalvsLP is Vascular smooth muscle contraction and can be visualized as follows:

plotPathway(gsa, gene.set = "Vascular smooth muscle contraction",
contrast = "BasalvsLP", gs.label = "kegg",
file.name = "Vascular_smooth_muscle_contraction")