1 Abstract

Gene set enrichment analysis encompasses a broad range of methods for prioritising the biological processes perturbed in genomics data. Indeed, the Bioconductor project hosts more than 80 software packages that identify with the task of gene set and pathway level analysis. The most common application of gene set enrichment analysis, is to help in the interpretation of the long lists of genes that arise from a differential expression analysis by overlaying externally curated gene signature collections to obtain higher level biological themes in a dataset. A key challenge in such an analysis is to choose an appropriate testing method and then visualize the results for enriched gene sets.

This workflow article demonstrates use of the EGSEA package on both RNA-seq and microarray data sets obtained from mouse mammary gland epithelial cell populations. Starting with normalized data, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA then uses an ensemble approach for gene set testing that combines results from up to 12 prominent algorithms available in Bioconductor in order to arrive at a consensus ranking. It returns an object that can be queried using several S4 methods for ranking gene sets and visualizing results in various ways, such as heatmaps, KEGG pathway views, GO graphs and bar plots. Finally, an HTML report that combines these displays provides a convenient means of sharing results with collaborators, allowing them to explore their biological relevance. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for mouse and human data.

2 Introduction

Gene set enrichment analysis allows researchers to efficiently extract biological insights from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of gene set enrichment (GSE) analysis methods released through the Bioconductor project (Huber et al. 2015) together with a steady increase in the number of gene set collections available through online databases such as MSigDB (Subramanian et al. 2005), GeneSetDB (Araki et al. 2012) and KEGG (Kanehisa and Goto 2000).

In an effort to unify these computational methods and knowledge bases, the EGSEA R/Bioconductor package was developed. EGSEA, which stands for Ensemble of Gene Set Enrichment Analysis (Alhamdoosh et al. 2017) combines the results from multiple algorithms to arrive at a consensus gene set ranking to identify biological themes and pathways perturbed in an experiment. EGSEA calculates seven statistics to combine the individual gene set statistics of base GSE methods, and to rank and hence identify biologically relevant gene sets. The current version of the EGSEA package (Alhamdoosh, Ng, and Ritchie 2017) utilizes the analysis results of twelve prominent GSE algorithms in the literature. These methods include: ora (Tavazoie et al. 1999), globaltest (Goeman et al. 2004), plage (Tomfohr, Lu, and Kepler 2005), safe (Barry, Nobel, and Wright 2005), zscore (E. Lee et al. 2008), gage (Luo et al. 2009), ssgsea (Barbie et al. 2009), padog (Tarca et al. 2012), gsva (Hänzelmann, Castelo, and Guinney 2013), camera (Wu and Smyth 2012), roast (Wu et al. 2010) and fry (Wu et al. 2010). The ora, gage, camera and gsva methods depend on a competitive null hypothesis while the remaining eight methods are based on a self-contained null hypothesis. Competitive tests assume the genes in a set do not have a stronger association with the experimental condition compared to randomly chosen genes outside the set, while a self-contained null hypothesis assumes the genes in a set do not have any association with the condition while ignoring genes outside the set.

The plage, zscore and ssgsea algorithms are implemented in the GSVA package and camera, fry and roast are implemented in the limma package (Ritchie et al. 2015). The ora method is implemented using the phyper function from the stats package, which estimates the hypergeometric distribution for a $2 \times 2$ contingency table. The remaining algorithms are implemented in Bioconductor packages of the same name. EGSEA is implemented with parallel computing features enabled using the parallel package. There are two levels of parallelism in EGSEA: (1) parallelism at the method-level and (2) parallelism at the experimental contrast level. A wrapper function is provided for each individual GSE method to utilize existing R packages and create a universal interface for all methods.

EGSEA provides access to a diverse range of gene signature collections through the EGSEAdata package that includes more than 25,000 gene sets for human and mouse organised according to their database sources (Table 1). For example, MSigDB (Subramanian et al. 2005) includes a number of collections (Hallmark (h) and c1-c7) that explore different biological themes ranging from very broad (h, c2, c5) through to more specialised ones focusing on cancer (c4, c6) and immunology (c7). The other main sources of GeneSetDB (Araki et al. 2012) and KEGG (Kanehisa and Goto 2000) have similar collections focusing on different biological characteristics. Which collections are selected in any given analysis should always be guided by the biological question of interest. The MSigDB c2 and c5 collections are the most widely used in our own analysis practice, spanning a wide range of biological processes that can often reveal new biological insights when applied to a given data set.

The purpose of this article is to demonstrate the gene set testing workflow available in EGSEA on both an RNA-seq and a microarray experiment. Each analysis involves four major steps: (1) selecting appropriate gene set collections for analysis and building an index that maps between the members of each set and the expression matrix; (2) choosing base GSE methods to combine along with ranking options; (3) running the EGSEA test and (4) reporting the results in various ways to share with collaborators (see Figure 1 for a summary). The EGSEA functions involved in each of these steps are introduced with code to demonstrate to Bioinformaticians and Computational Biologists how they can be deployed to help with the interpretation of gene expression data.

The main steps in an EGSEA analysis and the functions that perform each task.

3 Gene expression profiling of the mouse mammary gland

The first experiment analysed in this workflow is an RNA-seq data set from Sheridan et al. (2015) (Sheridan et al. 2015) that consists of 3 cell populations (Basal, Luminal Progenitor (LP) and Mature Luminal (ML)) sorted from the mammary glands of female virgin mice. Triplicate RNA samples from each population were obtained in 3 batches and sequenced on an Illumina HiSeq 2000 using a 100 base-pair single-ended protocol. Raw sequence reads from the fastq files were aligned to the mouse reference genome (mm10) using the Rsubread package (Liao, Smyth, and Shi 2013). Next, gene-level counts were obtained using featureCounts (Liao, Smyth, and Shi 2014) based on Rsubread’s built-in mm10 RefSeq-based annotation.
The raw data along with further information on experimental design and sample preparation can be downloaded from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) using GEO Series accession number GSE63310 and will be preprocessed according to the RNA-seq workflow published by Law et al. (2016) (Law et al. 2016).

The second experiment analysed in this workflow comes from Lim et al. (2010) (Lim et al. 2010) and is the microarray equivalent of the RNA-seq data set mentioned above. The same 3 populations (Basal (also referred to as MaSC-enriched), LP and ML) were sorted from mouse mammary glands via flow cytometry. Total RNA from 5 replicates of each cell population were hybridised onto 3 Illumina MouseWG-6 v2 BeadChips. The intensity files and chip annotation file available in Illumina’s proprietary formats (IDAT and BGX respectively) can be downloaded from http://bioinf.wehi.edu.au/EGSEA/arraydata.zip. The raw data from this experiment is also available from GEO under Series accession number GSE19446.

3.1 Analysis of RNA-seq data with EGSEA

Our RNA-seq analysis follows on directly from the workflow of Law et al. (2016) (Law et al. 2016) which performs a differential gene expression analysis on this data set using the Bioconductor packages (Huber et al. 2015) edgeR (Robinson, McCarthy, and Smyth 2010), limma (Ritchie et al. 2015) and Glimma (Su et al. 2017) with gene annotation from Mus.musculus (Bioconductor Core Team 2015). The limma package offers a well-developed suite of statistical methods for dealing with differential expression for both microarray and RNA-seq data sets and will be used in the analyses of both datasets presented in this workflow.

3.1.1 Reading, preprocessing and normalisation of RNA-seq data

To get started with this analysis, download the R data file mam.rnaseq.rdata from http://bioinf.wehi.edu.au/EGSEA/mam.rnaseq.rdata. The code below loads the preprocessed count matrix from Law et al. (2016) (Law et al. 2016), performs TMM normalisation (Robinson and Oshlack 2010) on the raw counts, and calculates voom weights for use in comparisons of Basal versus LP, Basal versus ML, and LP versus ML populations.

## [1] "samples" "counts"  "genes"

## [1] 14165     9

##   Basal LP ML L006 L008
## 1     0  1  0    0    0
## 2     0  0  1    0    0
## 3     1  0  0    0    0
## 4     1  0  0    1    0
## 5     0  0  1    1    0
## 6     0  1  0    1    0

##        Contrasts
## Levels  BasalvsLP BasalvsML LPvsML
##   Basal         1         1      0
##   LP           -1         0      1
##   ML            0        -1     -1
##   L006          0         0      0
##   L008          0         0      0

The voom function (Law et al. 2014) from the limma package converts counts to log-counts-per-million (log-cpm) and calculates observation-level precision weights. The voom object (v) contains log-cpm and gene information used by all of the methods in the EGSEA analysis below. The precision weights stored within v are also used by the camera, roast and fry methods.

v = voom(x, design, plot=FALSE)
names(v)

## [1] "genes"   "targets" "E"       "weights" "design"

For further information on preprocessing, see Law et al. (2016) (Law et al. 2016) as a detailed explanation of these steps is beyond the scope of this article.

3.1.2 Gene set testing

3.1.2.1 1. Exploring, selecting and indexing gene set collections

The package EGSEAdata includes more than 25,000 gene sets organized in collections depending on their database sources. Summary information about the gene set collections available in EGSEAdata can be displayed as follows:

library(EGSEAdata)
egsea.data("mouse")

## The following databases are available in EGSEAdata for Mus musculus:
## 
## Database name: KEGG Pathways
## Version: NA
## Download/update date: 07 March 2017
## Data source: gage::kegg.gsets()
## Supported species: human, mouse, rat
## Gene set collections: Signaling, Metabolism, Disease
## Related data objects: kegg.pathways
## Number of gene sets in each collection for  Mus musculus : 
##  Signaling: 132
##  Metabolism: 89
##  Disease: 67
## 
## Database name: Molecular Signatures Database (MSigDB)
## Version: 5.2
## Download/update date: 07 March 2017
## Data source: http://software.broadinstitute.org/gsea
## Supported species: human, mouse
## Gene set collections: h, c1, c2, c3, c4, c5, c6, c7
## Related data objects: msigdb, Mm.H, Mm.c2, Mm.c3, Mm.c4, Mm.c5, Mm.c6, Mm.c7
## Number of gene sets in each collection for  Mus musculus : 
##  h Hallmark Signatures: 50
##  c2 Curated Gene Sets: 4729
##  c3 Motif Gene Sets: 836
##  c4 Computational Gene Sets: 858
##  c5 GO Gene Sets: 6166
##  c6 Oncogenic Signatures: 189
##  c7 Immunologic Signatures: 4872
## 
## Database name: GeneSetDB Database
## Version: NA
## Download/update date: 15 January 2016
## Data source: http://www.genesetdb.auckland.ac.nz/
## Supported species: human, mouse, rat
## Gene set collections: gsdbdis, gsdbgo, gsdbdrug, gsdbpath, gsdbreg
## Related data objects: gsetdb.human, gsetdb.mouse, gsetdb.rat
## Number of gene sets in each collection for  Mus musculus : 
##  GeneSetDB Drug/Chemical: 6019
##  GeneSetDB Disease/Phenotype: 5077
##  GeneSetDB Gene Ontology: 2202
##  GeneSetDB Pathway: 1444
##  GeneSetDB Gene Regulation: 201
## 
## Type ?<data object name> to get a specific information 
##              about it, e.g., ?kegg.pathways.

As the output above suggests, users can obtain help on any of the collections using the R help (?) command, for instance ?Mm.c2 will return more information on the mouse version of the c2 collection from MSigDB. The above information can be returned as a list:

info = egsea.data("mouse", returnInfo = TRUE)
names(info)

## [1] "kegg"   "msigdb" "gsetdb"

info$msigdb$info$collections

## [1] "h"  "c1" "c2" "c3" "c4" "c5" "c6" "c7"

To highlight the capabilities of the EGSEA package, the KEGG pathways (Kanehisa and Goto 2000), c2 (curated gene sets) and c5 (Gene Ontology gene sets) collections from the MSigDB database (Subramanian et al. 2005) are selected.

Next, an index is built for each gene set collection using the EGSEA indexing functions to link the genes in the different gene set collections to the rows of our RNA-seq data. Indexes for the c2 and c5 collections from MSigDB and for the KEGG pathways are built using the buildIdx function.

library(EGSEA)

gs.annots = buildIdx(entrezIDs=v$genes$ENTREZID, species="mouse", 
           msigdb.gsets=c("c2", "c5"), go.part = TRUE)

## [1] "Loading MSigDB Gene Sets ... "
## [1] "Loaded gene sets for the collection c2 ..."
## [1] "Indexed the collection c2 ..."
## [1] "Created annotation for the collection c2 ..."
## [1] "Loaded gene sets for the collection c5 ..."
## [1] "Indexed the collection c5 ..."
## [1] "Created annotation for the collection c5 ..."
## MSigDB c5 gene set collection has been partitioned into 
## c5BP, c5CC, c5MF
## [1] "Building KEGG pathways annotation object ... "

names(gs.annots)

## [1] "c2"   "c5BP" "c5CC" "c5MF" "kegg"

To obtain additional information on the gene set collection indexes, including the total number of gene sets, the version number and date of last revision, the methods summary, show, getSetByName and getSetByID can be invoked on an object of class GSCollectionIndex, which stores all of the relevant gene set information, as follows:

class(gs.annots$c2)

## [1] "GSCollectionIndex"
## attr(,"package")
## [1] "EGSEA"

summary(gs.annots$c2)

## c2 Curated Gene Sets (c2): 4726 gene sets - Version: 5.2, Update date: 07 March 2017

show(gs.annots$c2)

## An object of class "GSCollectionIndex"
##  Number of gene sets: 4726
##  Annotation columns: ID, GeneSet, BroadUrl, Description, PubMedID, NumGenes, Contributor
##  Total number of indexing genes: 14165
##  Species: Mus musculus
##  Collection name: c2 Curated Gene Sets
##  Collection uniqe label: c2
##  Database version: 5.2
##  Database update date: 07 March 2017

s = getSetByName(gs.annots$c2, "SMID_BREAST_CANCER_LUMINAL_A_DN")

## ID: M13072
## GeneSet: SMID_BREAST_CANCER_LUMINAL_A_DN
## BroadUrl: http://www.broadinstitute.org/gsea/msigdb/cards/SMID_BREAST_CANCER_LUMINAL_A_DN.html
## Description: Genes down-regulated in the luminal A subtype of breast cancer.
## PubMedID: 18451135
## NumGenes: 23/24
## Contributor: Jessica Robertson

class(s)

## [1] "list"

names(s)

## [1] "SMID_BREAST_CANCER_LUMINAL_A_DN"

names(s$SMID_BREAST_CANCER_LUMINAL_A_DN)

## [1] "ID"          "GeneSet"     "BroadUrl"    "Description" "PubMedID"   
## [6] "NumGenes"    "Contributor"

Objects of class GSCollectionIndex store for each gene set the Entrez gene IDs in the slot original, the indexes in the slot idx and additional annotation for each set in the slot anno

slotNames(gs.annots$c2)

## [1] "original"   "idx"        "anno"       "featureIDs" "species"   
## [6] "name"       "label"      "version"    "date"

Other EGSEA functions such as buildCustomIdx, buildGMTIdx, buildKEGGIdx, buildMSigDBIdx and buildGeneSetDBIdx can be also used to build gene set collection indexes.

3.1.2.2 2. Configuring EGSEA

Before an EGSEA test is carried out, a few parameters need to be specified. First, a mapping between Entrez IDs and Gene Symbols is created for use by the visualization procedures. This mapping can be extracted from the genes slot of the voom object as follows:

colnames(v$genes)

## [1] "ENTREZID" "SYMBOL"   "CHR"

symbolsMap = v$genes[, c(1, 2)]
colnames(symbolsMap) = c("FeatureID", "Symbols")
symbolsMap[, "Symbols"] = as.character(symbolsMap[, "Symbols"])

Another important parameter in EGSEA is the list of base GSE methods (baseMethods in the code below), which determines the individual algorithms that are used in the ensemble testing. The supported base methods can be listed using the function egsea.base as follows:

egsea.base()

##  [1] "camera"     "roast"      "safe"       "gage"       "padog"     
##  [6] "plage"      "zscore"     "gsva"       "ssgsea"     "globaltest"
## [11] "ora"        "fry"

Eleven base methods are selected for the EGSEA analysis: camera (Wu and Smyth 2012), safe (Barry, Nobel, and Wright 2005), gage (Luo et al. 2009), padog (Tarca et al. 2012), plage (Tomfohr, Lu, and Kepler 2005), zscore (E. Lee et al. 2008), gsva (Hänzelmann, Castelo, and Guinney 2013), ssgsea (Barbie et al. 2009), globaltest (Goeman et al. 2004), ora (Tavazoie et al. 1999) and fry (Wu et al. 2010). Fry is a fast approximation that assumes equal gene-wise variances across samples, producing similar $p$-values to a roast analysis run with an infinite number of rotations.

baseMethods = egsea.base()[-2]
baseMethods

##  [1] "camera"     "safe"       "gage"       "padog"      "plage"     
##  [6] "zscore"     "gsva"       "ssgsea"     "globaltest" "ora"       
## [11] "fry"

Although, different combinations of base methods might produce different results, it has been found via simulation that including more methods gives better performance (Alhamdoosh et al. 2017).

Since each base method generates different $p$-values, EGSEA supports six different methods for combining individual $p$-values (Wilkinson (Wilkinson 1954) is default), which can be listed as follows:

egsea.combine()

## [1] "fisher"    "wilkinson" "average"   "logitp"    "sump"      "sumz"     
## [7] "votep"     "median"

Finally, the sorting of EGSEA results plays an essential role in identifying relevant gene sets. Any of EGSEA scores or the base methods’ rankings can be used for sorting EGSEA results as follows:

egsea.sort()

##  [1] "p.value"       "p.adj"         "vote.rank"     "avg.rank"     
##  [5] "med.rank"      "min.pvalue"    "min.rank"      "avg.logfc"    
##  [9] "avg.logfc.dir" "direction"     "significance"  "camera"       
## [13] "roast"         "safe"          "gage"          "padog"        
## [17] "plage"         "zscore"        "gsva"          "ssgsea"       
## [21] "globaltest"    "ora"           "fry"

Although p.adj is the default option for sorting EGSEA results for convenience, we recommend the use of either med.rank or vote.rank because they efficiently utilize the rankings of individual methods and tend to produce less false positives as observed in simulations (Alhamdoosh et al. 2017).

3.1.2.3 3. Ensemble testing with EGSEA

Next, the EGSEA analysis is performed using the egsea function that takes a voom object, contrast matrix, collections of gene sets and other run parameters as follows:

gsa = egsea(voom.results=v, contrasts=contr.matrix,  
         gs.annots=gs.annots, symbolsMap=symbolsMap,
         baseGSEAs=baseMethods, sort.by="med.rank",
         num.threads = 8, report = FALSE)

## EGSEA analysis has started

## ##------ Tue Jul 25 12:10:19 2017 ------##

## Log fold changes are estimated using limma package ...

## limma DE analysis is carried out ...

## Number of used cores has changed to 2
## in order to avoid CPU overloading.

## EGSEA is running on the provided data and c2 collection

##

## EGSEA is running on the provided data and c5BP collection

##

## EGSEA is running on the provided data and c5CC collection

##

## EGSEA is running on the provided data and c5MF collection

##

## EGSEA is running on the provided data and kegg collection

##

## ##------ Tue Jul 25 12:25:35 2017 ------##

## EGSEA analysis took 916.552 seconds.

## EGSEA analysis has completed

The running time of the EGSEA test depends on the base methods selected and whether report generation is enabled or not. The latter significantly increases the run time, particularly if the argument display.top is assigned a large value ($>$ 20) and/or a large number of gene set collections are selected. EGSEA reporting functionality generates set-level plots for every gene set and collection-level plots, which is a time consuming process.

The EGSEA package also has a function named egsea.cnt, that can perform the EGSEA test using an RNA-seq count matrix rather than a voom object, a function named egsea.ora, that can perform over-representation analysis with EGSEA reporting capabilities using only a vector of gene IDs, and the egsea.ma function that can perform EGSEA testing using a microarray expression matrix as shown later in the workflow.

3.1.2.3.1 Classes used to manage the results

The output of the functions egsea, egsea.cnt, egsea.ma and egsea.ora is an object of the S4 class EGSEAResults. Several S4 methods can be invoked to query this object. For example, an overview of the EGSEA analysis can be displayed using the show method as follows:

show(gsa)

## An object of class "EGSEAResults"
##  Total number of genes: 14165
##  Total number of samples: 9
##  Contrasts: BasalvsLP, BasalvsML, LPvsML
##  Base GSE methods: camera (limma:3.33.5), safe (safe:3.17.0), gage (gage:2.27.1), padog (PADOG:1.19.0), plage (GSVA:1.25.4), zscore (GSVA:1.25.4), gsva (GSVA:1.25.4), ssgsea (GSVA:1.25.4), globaltest (globaltest:5.31.0), ora (stats:3.4.1), fry (limma:3.33.5)
##  P-values combining method: wilkinson
##  Sorting statistic: med.rank
##  Organism: Mus musculus
##  HTML report generated: No
##  Tested gene set collections: 
##      c2 Curated Gene Sets (c2): 4726 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (BP) (c5BP): 4653 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (CC) (c5CC): 584 gene sets - Version: 5.2, Update date: 07 March 2017
##      c5 GO Gene Sets (MF) (c5MF): 928 gene sets - Version: 5.2, Update date: 07 March 2017
##      KEGG Pathways (kegg): 287 gene sets - Version: NA, Update date: 07 March 2017
##  EGSEA version: 1.5.2
##  EGSEAdata version: 1.5.0
## Use summary(object) and topSets(object, ...) to explore this object.

This command displays the number of genes and samples that were included in the analysis, the experimental contrasts, base GSE methods, $p$-value combining method, sorting statistic used and the size of each gene set collection. Note that the gene set collections are identified using the labels that appear in parentheses (e.g. c2) in the output of show.

3.1.2.4 4. Reporting EGSEA results

3.1.2.4.1 Getting top ranked gene sets

A summary of the top 10 gene sets in each collection for each contrast in addition to the EGSEA comparative analysis can be displayed using the S4 method summary as follows:

summary(gsa)

## **** Top 10 gene sets in the c2 Curated Gene Sets collection **** 
## ** Contrast BasalvsLP **
## LIM_MAMMARY_STEM_CELL_DN | LIM_MAMMARY_LUMINAL_PROGENITOR_UP
## MONTERO_THYROID_CANCER_POOR_SURVIVAL_UP | SMID_BREAST_CANCER_LUMINAL_A_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS
## REACTOME_TRANSFERRIN_ENDOCYTOSIS_AND_RECYCLING | FARMER_BREAST_CANCER_CLUSTER_2
## KEGG_EPITHELIAL_CELL_SIGNALING_IN_HELICOBACTER_PYLORI_INFECTION | LANDIS_BREAST_CANCER_PROGRESSION_UP
## 
## ** Contrast BasalvsML **
## LIM_MAMMARY_STEM_CELL_DN | LIM_MAMMARY_STEM_CELL_UP
## LIM_MAMMARY_LUMINAL_MATURE_DN | PAPASPYRIDONOS_UNSTABLE_ATEROSCLEROTIC_PLAQUE_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | LIM_MAMMARY_LUMINAL_MATURE_UP
## CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP | RICKMAN_HEAD_AND_NECK_CANCER_A
## YAGUE_PRETUMOR_DRUG_RESISTANCE_DN | BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN
## 
## ** Contrast LPvsML **
## LIM_MAMMARY_LUMINAL_MATURE_UP | LIM_MAMMARY_LUMINAL_MATURE_DN
## PHONG_TNF_RESPONSE_VIA_P38_PARTIAL | WOTTON_RUNX_TARGETS_UP
## WANG_MLL_TARGETS | PHONG_TNF_TARGETS_DN
## REACTOME_PEPTIDE_LIGAND_BINDING_RECEPTORS | CHIANG_LIVER_CANCER_SUBCLASS_CTNNB1_DN
## GERHOLD_RESPONSE_TO_TZD_DN | DURAND_STROMA_S_UP
## 
## ** Comparison analysis ** 
## LIM_MAMMARY_LUMINAL_MATURE_DN | LIM_MAMMARY_STEM_CELL_DN
## NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP | LIM_MAMMARY_LUMINAL_MATURE_UP
## COLDREN_GEFITINIB_RESISTANCE_DN | LIM_MAMMARY_STEM_CELL_UP
## CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP | LIM_MAMMARY_LUMINAL_PROGENITOR_UP
## BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN | MIKKELSEN_IPS_WITH_HCP_H3K27ME3
## 
## **** Top 10 gene sets in the c5 GO Gene Sets (BP) collection **** 
## ** Contrast BasalvsLP **
## GO_SYNAPSE_ORGANIZATION | GO_IRON_ION_TRANSPORT
## GO_CALCIUM_INDEPENDENT_CELL_CELL_ADHESION_VIA_PLASMA_MEMBRANE_CELL_ADHESION_MOLECULES | GO_PH_REDUCTION
## GO_HOMOPHILIC_CELL_ADHESION_VIA_PLASMA_MEMBRANE_ADHESION_MOLECULES | GO_VACUOLAR_ACIDIFICATION
## GO_FERRIC_IRON_TRANSPORT | GO_TRIVALENT_INORGANIC_CATION_TRANSPORT
## GO_NEURON_PROJECTION_GUIDANCE | GO_MESONEPHROS_DEVELOPMENT
## 
## ** Contrast BasalvsML **
## GO_FERRIC_IRON_TRANSPORT | GO_TRIVALENT_INORGANIC_CATION_TRANSPORT
## GO_IRON_ION_TRANSPORT | GO_NEURON_PROJECTION_GUIDANCE
## GO_GLIAL_CELL_MIGRATION | GO_SPINAL_CORD_DEVELOPMENT
## GO_REGULATION_OF_SYNAPSE_ORGANIZATION | GO_ACTION_POTENTIAL
## GO_MESONEPHROS_DEVELOPMENT | GO_NEGATIVE_REGULATION_OF_SMOOTH_MUSCLE_CELL_MIGRATION
## 
## ** Contrast LPvsML **
## GO_NEGATIVE_REGULATION_OF_NECROTIC_CELL_DEATH | GO_PARTURITION
## GO_RESPONSE_TO_VITAMIN_D | GO_GPI_ANCHOR_METABOLIC_PROCESS
## GO_REGULATION_OF_BLOOD_PRESSURE | GO_DETECTION_OF_MOLECULE_OF_BACTERIAL_ORIGIN
## GO_CELL_SUBSTRATE_ADHESION | GO_PROTEIN_TRANSPORT_ALONG_MICROTUBULE
## GO_INTRACILIARY_TRANSPORT | GO_CELLULAR_RESPONSE_TO_VITAMIN
## 
## ** Comparison analysis ** 
## GO_IRON_ION_TRANSPORT | GO_FERRIC_IRON_TRANSPORT
## GO_TRIVALENT_INORGANIC_CATION_TRANSPORT | GO_NEURON_PROJECTION_GUIDANCE
## GO_MESONEPHROS_DEVELOPMENT | GO_SYNAPSE_ORGANIZATION
## GO_REGULATION_OF_SYNAPSE_ORGANIZATION | GO_MEMBRANE_DEPOLARIZATION_DURING_CARDIAC_MUSCLE_CELL_ACTION_POTENTIAL
## GO_HOMOPHILIC_CELL_ADHESION_VIA_PLASMA_MEMBRANE_ADHESION_MOLECULES | GO_NEGATIVE_REGULATION_OF_SMOOTH_MUSCLE_CELL_MIGRATION
## 
## **** Top 10 gene sets in the c5 GO Gene Sets (CC) collection **** 
## ** Contrast BasalvsLP **
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_VACUOLAR_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX
## GO_MICROTUBULE_END | GO_MICROTUBULE_PLUS_END
## GO_ACTIN_FILAMENT_BUNDLE | GO_CELL_CELL_ADHERENS_JUNCTION
## GO_NEUROMUSCULAR_JUNCTION | GO_AP_TYPE_MEMBRANE_COAT_ADAPTOR_COMPLEX
## GO_INTERMEDIATE_FILAMENT | GO_CONDENSED_NUCLEAR_CHROMOSOME_CENTROMERIC_REGION
## 
## ** Contrast BasalvsML **
## GO_FILOPODIUM_MEMBRANE | GO_LATE_ENDOSOME_MEMBRANE
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_NEUROMUSCULAR_JUNCTION
## GO_COATED_MEMBRANE | GO_ACTIN_FILAMENT_BUNDLE
## GO_CLATHRIN_COAT | GO_AP_TYPE_MEMBRANE_COAT_ADAPTOR_COMPLEX
## GO_CLATHRIN_ADAPTOR_COMPLEX | GO_CONTRACTILE_FIBER
## 
## ** Contrast LPvsML **
## GO_CILIARY_TRANSITION_ZONE | GO_TCTN_B9D_COMPLEX
## GO_NUCLEAR_NUCLEOSOME | GO_INTRINSIC_COMPONENT_OF_ORGANELLE_MEMBRANE
## GO_ENDOPLASMIC_RETICULUM_QUALITY_CONTROL_COMPARTMENT | GO_KERATIN_FILAMENT
## GO_PROTEASOME_COMPLEX | GO_CILIARY_BASAL_BODY
## GO_PROTEASOME_CORE_COMPLEX | GO_CORNIFIED_ENVELOPE
## 
## ** Comparison analysis ** 
## GO_PROTON_TRANSPORTING_V_TYPE_ATPASE_COMPLEX | GO_ACTIN_FILAMENT_BUNDLE
## GO_NEUROMUSCULAR_JUNCTION | GO_AP_TYPE_MEMBRANE_COAT_ADAPTOR_COMPLEX
## GO_CONTRACTILE_FIBER | GO_INTERMEDIATE_FILAMENT
## GO_LATE_ENDOSOME_MEMBRANE | GO_CLATHRIN_VESICLE_COAT
## GO_ENDOPLASMIC_RETICULUM_QUALITY_CONTROL_COMPARTMENT | GO_MICROTUBULE_END
## 
## **** Top 10 gene sets in the c5 GO Gene Sets (MF) collection **** 
## ** Contrast BasalvsLP **
## GO_HYDROGEN_EXPORTING_ATPASE_ACTIVITY | GO_SIGNALING_PATTERN_RECOGNITION_RECEPTOR_ACTIVITY
## GO_LIPID_TRANSPORTER_ACTIVITY | GO_TRIGLYCERIDE_LIPASE_ACTIVITY
## GO_AMINE_BINDING | GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE
## GO_NEUROPEPTIDE_RECEPTOR_ACTIVITY | GO_WIDE_PORE_CHANNEL_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_LIPASE_ACTIVITY
## 
## ** Contrast BasalvsML **
## GO_G_PROTEIN_COUPLED_RECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_KINASE_ACTIVITY
## GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE | GO_VOLTAGE_GATED_SODIUM_CHANNEL_ACTIVITY
## GO_CORECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_TYROSINE_KINASE_ACTIVITY
## GO_LIPID_TRANSPORTER_ACTIVITY | GO_SULFOTRANSFERASE_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
## 
## ** Contrast LPvsML **
## GO_MANNOSE_BINDING | GO_PHOSPHORIC_DIESTER_HYDROLASE_ACTIVITY
## GO_BETA_1_3_GALACTOSYLTRANSFERASE_ACTIVITY | GO_COMPLEMENT_BINDING
## GO_ALDEHYDE_DEHYDROGENASE_NAD_ACTIVITY | GO_MANNOSIDASE_ACTIVITY
## GO_LIGASE_ACTIVITY_FORMING_CARBON_NITROGEN_BONDS | GO_CARBOHYDRATE_PHOSPHATASE_ACTIVITY
## GO_LIPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
## 
## ** Comparison analysis ** 
## GO_STRUCTURAL_CONSTITUENT_OF_MUSCLE | GO_LIPID_TRANSPORTER_ACTIVITY
## GO_CATION_TRANSPORTING_ATPASE_ACTIVITY | GO_CHEMOREPELLENT_ACTIVITY
## GO_HEPARAN_SULFATE_PROTEOGLYCAN_BINDING | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_TYROSINE_KINASE_ACTIVITY
## GO_LIPASE_ACTIVITY | GO_PEPTIDE_RECEPTOR_ACTIVITY
## GO_CORECEPTOR_ACTIVITY | GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_KINASE_ACTIVITY
## 
## **** Top 10 gene sets in the KEGG Pathways collection **** 
## ** Contrast BasalvsLP **
## Collecting duct acid secretion | alpha-Linolenic acid metabolism
## Synaptic vesicle cycle | Hepatitis C
## Vascular smooth muscle contraction | Rheumatoid arthritis
## cGMP-PKG signaling pathway | Axon guidance
## Progesterone-mediated oocyte maturation | Arrhythmogenic right ventricular cardiomyopathy (ARVC)
## 
## ** Contrast BasalvsML **
## Collecting duct acid secretion | Synaptic vesicle cycle
## Other glycan degradation | Axon guidance
## Arrhythmogenic right ventricular cardiomyopathy (ARVC) | Glycerophospholipid metabolism
## Lysosome | Vascular smooth muscle contraction
## Protein digestion and absorption | Oxytocin signaling pathway
## 
## ** Contrast LPvsML **
## Glycosylphosphatidylinositol(GPI)-anchor biosynthesis | Histidine metabolism
## Drug metabolism - cytochrome P450 | PI3K-Akt signaling pathway
## Proteasome | Sulfur metabolism
## Renin-angiotensin system | Nitrogen metabolism
## Tyrosine metabolism | Systemic lupus erythematosus
## 
## ** Comparison analysis ** 
## Collecting duct acid secretion | Synaptic vesicle cycle
## Vascular smooth muscle contraction | Axon guidance
## Arrhythmogenic right ventricular cardiomyopathy (ARVC) | Oxytocin signaling pathway
## Lysosome | Adrenergic signaling in cardiomyocytes
## Linoleic acid metabolism | cGMP-PKG signaling pathway

The comparative analysis allows researchers to estimate the significance of a gene set across multiple experimental contrasts. This analysis helps in the identification of biological processes that are perturbed by multiple experimental conditions simultaneously. This experiment is the RNA-seq equivalent of Lim et al. (2010) (Lim et al. 2010), who used Illumina microarrays to study the same cell populations (see later), so it makes sense to observe the LIM gene signatures amongst the top ranked c2 gene signatures in both the individual contrasts and comparative results.

Another way of exploring the EGSEA results is to retrieve the top ranked $N$ sets in each collection and contrast using the method topSets. For example, the top 10 gene sets in the c2 collection for the comparative analysis can be retrieved as follows:

topSets(gsa, gs.label="c2", contrast = "comparison", names.only=TRUE)

## Extracting the top gene sets of the collection 
## c2 Curated Gene Sets for the contrast comparison
##  Sorted by med.rank

##  [1] "LIM_MAMMARY_LUMINAL_MATURE_DN"                  
##  [2] "LIM_MAMMARY_STEM_CELL_DN"                       
##  [3] "NAKAYAMA_SOFT_TISSUE_TUMORS_PCA2_UP"            
##  [4] "LIM_MAMMARY_LUMINAL_MATURE_UP"                  
##  [5] "COLDREN_GEFITINIB_RESISTANCE_DN"                
##  [6] "LIM_MAMMARY_STEM_CELL_UP"                       
##  [7] "CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP"
##  [8] "LIM_MAMMARY_LUMINAL_PROGENITOR_UP"              
##  [9] "BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_DN"  
## [10] "MIKKELSEN_IPS_WITH_HCP_H3K27ME3"

The gene sets are ordered based on the med.rank score since it has been selected when egsea was invoked above. When the argument names.only is set to FALSE, additional information is displayed for each gene set including gene set annotation, the EGSEA scores and the individual rankings by each base method.

As expected, gene sets retrieved by EGSEA included the LIM gene sets (Lim et al. 2010) that were derived from microarray profiles of analagous mammary cell populations (sets 1, 2, 4, 6 and 8) as well as those derived from populations with similar origin (sets 7 and 9) and behaviour or characteristics (sets 5 and 10).

Next, topSets can be used to search for the ranking of gene sets of interest based on different EGSEA scores as well as the rankings of individual methods. For example, the ranking of the six LIM gene sets from the c2 collection can be displayed based on the med.rank as follows:

t = topSets(gsa, contrast = "comparison",
             names.only=FALSE, number = Inf, verbose = FALSE)
t[grep("LIM_", rownames(t)), c("p.adj", "Rank", "med.rank", "vote.rank")]

##                                           p.adj Rank med.rank vote.rank
## LIM_MAMMARY_LUMINAL_MATURE_DN      1.646053e-29    1       36         5
## LIM_MAMMARY_STEM_CELL_DN           6.082053e-43    2       37         5
## LIM_MAMMARY_LUMINAL_MATURE_UP      2.469061e-22    4       92         5
## LIM_MAMMARY_STEM_CELL_UP          3.154132e-103    6      134         5
## LIM_MAMMARY_LUMINAL_PROGENITOR_UP  3.871536e-30    8      180         5
## LIM_MAMMARY_LUMINAL_PROGENITOR_DN  2.033005e-06  179      636       115

Five of the LIM gene sets are ranked in the top 10 by EGSEA. The values shown in the median rank (med.rank) column indicate that individual methods can assign much lower ranks to these sets. Prioritisation of these gene sets demonstrates the benefit of an ensemble approach. Similarly, we can find the top 10 pathways in the KEGG collection from the ensemble analysis for the Basal versus LP contrast and the comparative analysis as follows:

topSets(gsa, gs.label="kegg", contrast="BasalvsLP", sort.by="med.rank")

## Extracting the top gene sets of the collection 
## KEGG Pathways for the contrast BasalvsLP
##  Sorted by med.rank

##  [1] "Collecting duct acid secretion"                        
##  [2] "alpha-Linolenic acid metabolism"                       
##  [3] "Synaptic vesicle cycle"                                
##  [4] "Hepatitis C"                                           
##  [5] "Vascular smooth muscle contraction"                    
##  [6] "Rheumatoid arthritis"                                  
##  [7] "cGMP-PKG signaling pathway"                            
##  [8] "Axon guidance"                                         
##  [9] "Progesterone-mediated oocyte maturation"               
## [10] "Arrhythmogenic right ventricular cardiomyopathy (ARVC)"

topSets(gsa, gs.label="kegg", contrast="comparison", sort.by="med.rank")

## Extracting the top gene sets of the collection 
## KEGG Pathways for the contrast comparison
##  Sorted by med.rank

##  [1] "Collecting duct acid secretion"                        
##  [2] "Synaptic vesicle cycle"                                
##  [3] "Vascular smooth muscle contraction"                    
##  [4] "Axon guidance"                                         
##  [5] "Arrhythmogenic right ventricular cardiomyopathy (ARVC)"
##  [6] "Oxytocin signaling pathway"                            
##  [7] "Lysosome"                                              
##  [8] "Adrenergic signaling in cardiomyocytes"                
##  [9] "Linoleic acid metabolism"                              
## [10] "cGMP-PKG signaling pathway"

EGSEA highlights many pathways with known importance in the mammary gland such as those associated with distinct roles in lactation like basal cell contraction (Vascular smooth muscle contraction and Oxytocin signalling pathway) and milk production and secretion from luminal lineage cells (Collecting duct acid secretion, Synaptic vesicle cycle and Lysosome).

3.1.2.4.2 Visualizing results at the gene set level

Graphical representation of gene expression patterns within and between gene sets is an essential part of communicating results of an analysis to collaborators and other researchers. EGSEA enables users to explore the elements of a gene set via a heatmap using the plotHeatmap method. For example, the log-fold-changes of the genes of LIM MAMMARY STEM CELL UP and LIM MAMMARY STEM CELL DN can be visualized across all contrasts using heatmaps (Figure 2) generated by the code below.

plotHeatmap(gsa, gene.set="LIM_MAMMARY_STEM_CELL_UP", gs.label="c2",
         contrast = "comparison", file.name = "hm_cmp_LIM_MAMMARY_STEM_CELL_UP", format="png")

## Generating heatmap for LIM_MAMMARY_STEM_CELL_UP from the collection 
## c2 Curated Gene Sets and for the contrast comparison

plotHeatmap(gsa, gene.set="LIM_MAMMARY_STEM_CELL_DN", gs.label="c2",
         contrast = "comparison", file.name = "hm_cmp_LIM_MAMMARY_STEM_CELL_DN", format="png")

## Generating heatmap for LIM_MAMMARY_STEM_CELL_DN from the collection 
## c2 Curated Gene Sets and for the contrast comparison

Heatmap of log-fold-changes for genes in the LIM MAMMARY STEM CELL UP gene set across three experimental comparisons.

When using plotHeatmap, gene.set value must match the name returned from the topSets method. The rows of the heatmap represent the genes that are mapped to the set and the columns represent the experimental contrasts. The heatmap colour-scale ranges from down-regulated (blue) to up-regulated (red) while the row labels are coloured in green when the genes are statistically significant in the DE analysis (i.e. FDR $\leq$ 0.05 in at least one contrast, Figure 2). Heatmaps can be generated for individual comparisons by changing the contrast argument of plotHeatmap. The plotHeatmap method also generates a CSV file that includes the DE analysis results from the limma::topTable for all expressed genes in the selected gene set and for each contrast (in the case of contrast = "comparison"). This file can be used to create customised plots using other R/Bioconductor packages.

In addition to heatmaps, pathway maps can be generated for the KEGG gene sets using the plotPathway method using functionality from the pathview package (Luo and Brouwer 2013). For example, the first KEGG signalling pathway retrieved for the contrast BasalvsLP is Vascular smooth muscle contraction and can be visualized as follows:

plotPathway(gsa, gene.set = "Vascular smooth muscle contraction", 
             contrast = "BasalvsLP", gs.label = "kegg", 
             file.name = "Vascular_smooth_muscle_contraction")

Pathway map for Vascular smooth muscle contraction (KEGG pathway ID mmu04270) with log-fold-changes from the Basal vs LP contrast.

Pathway components are coloured based on the gene-specific log-fold-changes as calculated in the limma DE analysis (Figure 3). Similarly, a comparative map can be generated for a given pathway across all contrasts.

plotPathway(gsa, gene.set = "Vascular smooth muscle contraction", 
             contrast = "comparison", gs.label = "kegg", 
             file.name = "Vascular_smooth_muscle_contraction_cmp")

Pathway map for Vascular smooth muscle contraction (KEGG pathway ID mmu04270) with log-fold-changes across three experimental contrasts shown for each gene in the same order left to right that they appear in the contrast matrix (i.e. Basal vs LP, Basal vs ML and LP vs ML).

The comparative pathway map shows the log-fold-changes for each gene in each contrast by dividing the gene nodes on the map into multiple columns, one for each contrast (Figure 4).

3.1.2.4.3 Visualizing results at the experiment level

Since EGSEA combines the results from multiple gene set testing methods, it can be interesting to compare how different base methods rank a given gene set collection for a selected contrast. The command below generates a multi-dimensional scaling (MDS) plot for the ranking of gene sets across all the base methods used (Figure 5). Methods that rank gene sets similarly will appear closer together in this plot and from these plots we see that certain methods tend to cluster together across different gene set collections (Figure 5).

plotMethods(gsa, gs.label = "c2", contrast = "BasalvsLP", 
         file.name = "mds_c2_BasalvsLP", format="png")

## Generating methods plot for the collection 
## c2 Curated Gene Sets and for the contrast BasalvsLP

## [1] "png " "  2 "

plotMethods(gsa, gs.label = "c5BP", contrast = "BasalvsLP", 
         file.name = "mds_c5_BasalvsLP", format="png")

## Generating methods plot for the collection 
## c5 GO Gene Sets (BP) and for the contrast BasalvsLP

## [1] "png " "  2 "

Multi-dimensional scaling (MDS) plot for the ranking of the c2 and c5 gene sets on the Basal vs LP contrast for multiple gene set testing methods.

The significance of each gene set in a given collection for a selected contrast can be visualized using EGSEA’s summary plot. The summary plot visualizes the gene sets as bubbles based on the $-\log_{10}{(p\mbox{-}value)}$ (X-axis) and the average absolute log fold-change of the set genes (Y-axis). The sets that appear towards the top-right corner of the plots are assumed to be biologically relevant. EGSEA generates two types of summary plots: the directional summary plot, which colours the bubbles based on the regulation direction of the gene set (the direction of the majority of genes), and the ranking summary plot, which colours the bubbles based on the gene set ranking in a given collection (according to the sort.by argument). The bubble size is based on the EGSEA Significance Score in the former plot and based on the gene set size in the latter. For example, the summary plots of the KEGG pathways for the LP vs ML contrast show few significant pathways (Figure 6). The blue colour labels on the directional plot represents gene sets that do not appear in the top $10$ gene sets that are selected based on the sort.by argument while their EGSEA Significance Scores are among the top $5$ in the entire collection. The gene set IDs and more information about each set can be found in the EGSEA HTML report generated later.

plotSummary(gsa, gs.label = 3, contrast = 3, 
         file.name = "summary_kegg_LPvsML", format="png")

## Generating Summary plots for the collection 
## c5 GO Gene Sets (CC) and for the contrast LPvsML

Summary plots (directional (left) and ranking (right)) of the significance of all gene sets in the KEGG collection for the LP vs ML contrast.

By default, plotSummary uses a gene set’s p.adj score for the X-axis. This behaviour can be easily modified by assigning any of the available sort.by scores into the parameter x.axis, for example, med.rank can be used to create an EGSEA summary plot (Figure 7, left panel) as follows:

plotSummary(gsa, gs.label = 1, contrast = 3, 
         file.name = "summary_c2_LPvsML", 
         x.axis = "med.rank", format="png")

## Generating Summary plots for the collection 
## c2 Curated Gene Sets and for the contrast LPvsML

The summary plot tends to become very cluttered when the size of the gene set collection is very large (Figure 7, left panel). Therefore, the parameter x.cutoff of plotSummary can be used to visualize just the significant gene sets rather than plotting the entire gene set collection, for example (Figure 7, right panel):

plotSummary(gsa, gs.label = 1, contrast = 3, 
         file.name = "summary_sig_c2_LPvsML", 
         x.axis = "med.rank", x.cutoff=300, format="png")

## Generating Summary plots for the collection 
## c2 Curated Gene Sets and for the contrast LPvsML

Summary plots (directional (left) and filtered directional (right)) of the significance of selected gene sets in the c2 collection for the LP vs ML contrast.

Comparative summary plots can be also generated to compare the significance of gene sets between two contrasts, for example, the comparison between Basal vs LP and Basal vs ML (Figure 8) shows that most of the KEGG pathways are regulated in the same direction with a few pathways that are regulated in opposite directions (purple coloured bubbles in Figure 8, left panel). Such figures can be generated using the plotSummary method as follows:

plotSummary(gsa, gs.label = "kegg", contrast = c(1,2), 
         file.name = "summary_kegg_1vs2", format="png")

## Generating Summary plots for the collection 
## KEGG Pathways and for the comparison BasalvsLP vs BasalvsML

Summary plots (directional (left) and ranking (right)) of the significance of all gene sets in the KEGG collection for the comparison of the contrasts: Basal vs LP and Basal vs ML.

The plotSummary method has two useful parameters: (i) use.names that can be used to display gene set names instead of gene set IDs and (ii) interactive that can be used to generate an interactive HTML plot.

The c5 collection of MSigDB and the Gene Ontology collection of GeneSetDB contain Gene Ontology (GO) terms. These collections are meant to be non-redundant, containing only a small subset of the entire GO and visualizing how these terms are related to each other can be informative. EGSEA utilizes functionality from the topGO package (Alexa and Rahnenfuhrer 2016) to generate GO graphs for the significant biological processes (BPs), cellular compartments (CCs) and molecular functions (MFs). The plotGOGraph method can generate such a display (Figure 9) as follows:

plotGOGraph(gsa, gs.label="c5BP", contrast = 1, file.name="BasalvsLP-c5BP-top-", format="png")

## Generating GO Graphs for the collection c5 GO Gene Sets (BP)
##  and for the contrast BasalvsLP based on the med.rank

## 
## Building most specific GOs .....

## Loading required package: org.Mm.eg.db

##  ( 10451 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 14454 GO terms and 34612 relations. )

## 
## Annotating nodes ...............

##  ( 13025 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 3619 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 4090 GO terms and 5185 relations. )

## 
## Annotating nodes ...............

##  ( 12984 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 1505 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 1769 GO terms and 3525 relations. )

## 
## Annotating nodes ...............

##  ( 13178 genes annotated to the GO terms. )

## Loading required package: Rgraphviz

## Loading required package: grid

## 
## Attaching package: 'grid'

## The following object is masked from 'package:topGO':
## 
##     depth

## 
## Attaching package: 'Rgraphviz'

## The following objects are masked from 'package:IRanges':
## 
##     from, to

## The following objects are masked from 'package:S4Vectors':
## 
##     from, to

plotGOGraph(gsa, gs.label="c5CC", contrast = 1, file.name="BasalvsLP-c5CC-top-", format="png")

## Generating GO Graphs for the collection c5 GO Gene Sets (CC)
##  and for the contrast BasalvsLP based on the med.rank

## 
## Building most specific GOs .....

##  ( 10451 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 14454 GO terms and 34612 relations. )

## 
## Annotating nodes ...............

##  ( 13025 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 3619 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 4090 GO terms and 5185 relations. )

## 
## Annotating nodes ...............

##  ( 12984 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 1505 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 1769 GO terms and 3525 relations. )

## 
## Annotating nodes ...............

##  ( 13178 genes annotated to the GO terms. )

## png 
##   2

GO graphs of the top significant GO terms from the c5 gene set collection (cellular compartment (left) and biological process (right)) for the contrast Basal vs LP.

The GO graphs are coloured based on the values of the argument sort.by, which in this instance was taken as med.rank by default since this was selected when EGSEA was invoked. The top five most significant GO terms are highlighted by default in each GO category (MF, CC or BP). More terms can be displayed by changing the value of the parameter noSig. However, this might generate very complicated and unresolved graphs. The colour of the nodes vary between red (most significant) and yellow (least significant). The values of the sort.by scoring function are scaled between 0 and 1 to generate these graphs.

Another way of visualizing results at the experiment level is via the summary bar plot. The method plotBars can be used to generate a bar plot for the top $N$ gene sets in an individual collection for a particular contrast or from a comparative analysis across multiple contrasts. By default, the $-\log_{10}(p.adj)$ values are used to plot the bars, the top 20 gene sets are displayed and the gene sets are selected and ordered based on the parameter sort.by when the EGSEA analysis was run. The parameters bar.vals, number and sort.by of plotBars can be assigned to customize the bar plot. For example, the top 20 gene sets of the comparative analysis carried out on the c2 collection of MSigDB can be visualized using a bar plot (Figure 10) as follows:

plotBars(gsa, gs.label = "c2", contrast = "comparison", file.name="comparison-c2-bars", format="png")

## Generating a bar plot for the collection c2 Curated Gene Sets 
##  and the contrast comparison

## png 
##   2

The colour of each bar is based on the regulation direction of the gene sets, i.e., red for up-regulated, blue for down-regulated and purple for neutral regulation (in the case of comparative analysis of experimental contrasts that have show opposite behaviours) (Figure 11).

$Bar plot of the -\log_{10}(p-value) of the top 20 gene sets from the comparative analysis of the c2 collection.$

Bar plot of the $-\log_{10}(p-value)$ of the top 20 gene sets from the comparative analysis of the c2 collection.

When multiple conditions are studied in the experiment, the summary heatmap in EGSEA is a desirable way of visualization. The method plotSummaryHeatmaps generates a heatmap of the top $N$ gene sets in the comparative analysis across all experimental conditions. By default, 20 gene sets are selected based on the parameter sort.by when the analysis was invoked and the heat map values are the average log-fold changes at the set level for the genes that are regulated in the same direction as the set regulation direction, i.e., avg.logfc.dir. The parameters number, sort.by and hm.vals of the plotSummaryHeatmaps can be used to customize the summary heatmap. Additionally, the parameter show.vals can be used to display the values of a specific EGSEA score on the heatmap cells. Next, an example of the summary heatmap is generated for the MSigDB c2 collection:

plotSummaryHeatmap(gsa, gs.label="c2", hm.vals = "avg.logfc.dir",
         file.name="summary_heatmaps_c2", format="png")

## Generating summary heatmap for the collection c2 Curated Gene Sets
## sort.by: med.rank, hm.vals: avg.logfc.dir, show.vals:

plotSummaryHeatmap(gsa, gs.label="kegg", hm.vals = "avg.logfc.dir",
         file.name="summary_heatmaps_kegg", format="png")

## Generating summary heatmap for the collection KEGG Pathways
## sort.by: med.rank, hm.vals: avg.logfc.dir, show.vals:

Summary heat maps for the top 20 gene sets from the c2 (left) and KEGG collections obtained from the EGSEA comparative analysis.

The heatmap view at both the gene set and summary level and the summary level bar plots are useful summaries to include in publications to highlight the results from gene set testing. The top differentially expressed genes from each contrast can be accessed from the EGSEAResults object using the limmaTopTable method.

t = limmaTopTable(gsa, contrast=1)
head(t)

##        ENTREZID   SYMBOL CHR     logFC  AveExpr         t      P.Value
## 19253     19253   Ptpn18   1 -5.632846 4.129760 -34.54340 5.871553e-10
## 16324     16324    Inhbb   1 -4.792255 6.462247 -33.22206 7.993374e-10
## 53624     53624    Cldn7  11 -5.514605 6.296762 -40.24283 1.752279e-10
## 218518   218518 Marveld2  13 -5.139809 4.931334 -34.78540 5.556030e-10
## 12759     12759      Clu  14 -5.442877 8.857907 -40.97191 1.519884e-10
## 70350     70350    Basp1  15 -6.073297 5.248349 -34.29589 6.215256e-10
##           adj.P.Val        B
## 19253  9.623639e-07 13.21229
## 16324  9.623639e-07 13.34734
## 53624  9.623639e-07 14.52532
## 218518 9.623639e-07 13.46618
## 12759  9.623639e-07 14.74741
## 70350  9.623639e-07 13.34300

3.1.2.4.4 Creating an HTML report of the results

To generate an EGSEA HTML report for this dataset, you can either set report=TRUE when you invoke egsea or use the S4 method generateReport as follows:

generateReport(gsa, number = 20, report.dir="./mam-rnaseq-egsea-report")

## EGSEA HTML report is being generated ...

## ##------ Tue Jul 25 12:27:10 2017 ------##

## Number of used cores has changed to 2
## in order to avoid CPU overloading.

## Report pages and figures are being generated for the c2 collection ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

## Report pages and figures are being generated for the c5BP collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

## 
## Building most specific GOs .....

##  ( 10451 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 14454 GO terms and 34612 relations. )

## 
## Annotating nodes ...............

##  ( 13025 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 3619 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 4090 GO terms and 5185 relations. )

## 
## Annotating nodes ...............

##  ( 12984 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 1505 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 1769 GO terms and 3525 relations. )

## 
## Annotating nodes ...............

##  ( 13178 genes annotated to the GO terms. )

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the c5CC collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the c5MF collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the kegg collection ...

##    Pathway maps are being generated for top-ranked 
##  pathways based on logFC ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    Pathway maps are being generated for top-ranked comparative
## pathways based on logFC ...

## ##------ Tue Jul 25 12:37:49 2017 ------##

## EGSEA report generation took 638.799 seconds.

## EGSEA report has been generated.

The EGSEA report generated for this dataset is available online at http://bioinf.wehi.edu.au/folders/EGSEA/mam-rnaseq-egsea-report/index.html (Figure 12). The HTML report is a convenient means of organising all of the results generated up to now, from the individual tables to the gene set level heatmaps and pathway maps to the summary level plots. It can easily be shared with collaborators to allow them to explore their results more fully. Interactive tables of results via the DT package (https://CRAN.R-project.org/package=DT) and summary plots from plotly (https://CRAN.R-project.org/package=plotly) are integrated into the report using htmlwidgets (https://CRAN.R-project.org/package=htmlwidgets) and can be added when interactive = TRUE in the command above. This setting significantly increases both the run time and size of the final report due to the large number of gene sets in most collections.

EGSEA HTML report main page.

This example completes our overview of EGSEA’s gene set testing and plotting capabilities for RNA-seq data. The same functionality can also be used for gene set testing of microarray data. Readers can further refer to the vignette and/or the reference manual of the EGSEA package available at the Bioconductor website for more technical details on each of the above methods and classes.

3.2 Analysis of microarray data with EGSEA

The second data set analysed in this workflow comes from Lim et al. (2010) (Lim et al. 2010) and is the microarray equivalent of the RNA-seq data analysed above. Support for microarray data is a new feature in EGSEA, and in this example, we show an express route for analysis according to the steps shown in Figure 1, from selecting gene sets and building indexes, to configuring EGSEA, testing and reporting the results. But first the data must be appropriately preprocessed to get it in shape for an EGSEA analysis and to do this we make use of functions available in limma.

3.2.1 Reading, preprocessing and Normalisation of microarray data

To analyse this data set, we begin by unzipping the files downloaded from http://bioinf.wehi.edu.au/EGSEA/arraydata.zip into the current working directory. Illumina BeadArray data can be read in directly using the readIDAT and readBGX functions from the illuminaio package (Smith et al. 2013). A more convenient way is via the read.idat function in limma which uses these illuminaio functions and outputs the data as an EListRaw object for further processing.

library(limma)
url = "http://bioinf.wehi.edu.au/EGSEA/arraydata.zip"
utils::download.file(url, destfile="arraydata.zip", mode="wb")
utils::unzip("arraydata.zip", exdir = ".")
targets = read.delim("targets.txt", header=TRUE, sep=" ")
data = read.idat(as.character(targets$File), 
                   bgxfile="GPL6887_MouseWG-6_V2_0_R0_11278593_A.bgx",
                   annotation=c("Entrez_Gene_ID","Symbol", "Chromosome"))

## Reading manifest file GPL6887_MouseWG-6_V2_0_R0_11278593_A.bgx ... Done
##   4481850214_B_Grn.idat ... Done
##   4481850214_C_Grn.idat ... Done
##   4481850214_D_Grn.idat ... Done
##   4481850214_F_Grn.idat ... Done
##   4481850187_A_Grn.idat ... Done
##   4481850187_B_Grn.idat ... Done
##   4481850187_D_Grn.idat ... Done
##   4481850187_E_Grn.idat ... Done
##   4481850187_F_Grn.idat ... Done
##   4466975058_A_Grn.idat ... Done
##   4466975058_B_Grn.idat ... Done
##   4466975058_C_Grn.idat ... Done
##   4466975058_D_Grn.idat ... Done
##   4466975058_E_Grn.idat ... Done
##   4466975058_F_Grn.idat ... Done
## Finished reading data.

data$other$Detection = detectionPValues(data)
data$targets = targets
colnames(data) = targets$Sample

Next the neqc function in limma is used to carry out normexp background correction and quantile normalisation on the raw intensity values using negative control probes (Shi, Oshlack, and Smyth 2010) followed by $\log_2$-transformation of the normalised intensity values and removal of the control probes.

data = neqc(data)

We then filter out probes that are consistently non-expressed or lowly expressed throughout all samples as they are uninformative in downstream analysis. We require probes to have a detection $p$-values of less than 0.05 in at least 5 samples (the number of samples within each group) which retains 21,643 probes for further analysis.

table(targets$Celltype)

## 
## Basal    LP    ML 
##     5     5     5

keep.exprs = rowSums(data$other$Detection<0.05)>=5
table(keep.exprs)

## keep.exprs
## FALSE  TRUE 
## 23638 21643

data = data[keep.exprs,]
dim(data)

## [1] 21643    15

3.2.2 Preparing microarray data for EGSEA testing

As before, we need to set up an appropriate linear model (Smyth 2004) and contrasts matrix to look for differences between the Basal and LP, Basal and ML and LP and ML populations. A batch term is included in the linear model to account for differences in expression that are attributable to the day the experiment was run.

head(data$genes)

##       Probe_Id Array_Address_Id Entrez_Gene_ID        Symbol Chromosome
## 3 ILMN_1219601          2030280           <NA> C920011N12Rik           
## 4 ILMN_1252621          1980164         101142 2700050P07Rik          6
## 6 ILMN_3162407          6220026           <NA>         Zfp36           
## 7 ILMN_2514723          2030072           <NA> 1110067B18Rik           
## 8 ILMN_2692952          6040743         329831 4833436C18Rik          4
## 9 ILMN_1257952          7160091           <NA> B930060K05Rik

sum(is.na(data$genes$Entrez_Gene_ID))

## [1] 11535

data1 = data[!is.na(data$genes$Entrez_Gene_ID), ]
dim(data1)

## [1] 10108    15

expr = data1$E
group = as.factor(data1$targets$Celltype)
probe.annot = data1$genes[, 2:4]
head(probe.annot)

##    Array_Address_Id Entrez_Gene_ID        Symbol
## 4           1980164         101142 2700050P07Rik
## 8           6040743         329831 4833436C18Rik
## 15          7380500         230936         Phf13
## 16           130070         230936         Phf13
## 22          5260661         384763        Zfp667
## 32          5360474          67073        Pi4k2b

head(data1$targets)

##                      File Sample Celltype Time Experiment
## 2-2 4481850214_B_Grn.idat    2-2       ML  At1          1
## 3-3 4481850214_C_Grn.idat    3-3       LP  At1          1
## 4-4 4481850214_D_Grn.idat    4-4    Basal  At1          1
## 6-7 4481850214_F_Grn.idat    6-7       ML  At2          1
## 7-8 4481850187_A_Grn.idat    7-8       LP  At2          1
## 8-9 4481850187_B_Grn.idat    8-9    Basal  At2          1

experiment = as.character(data1$targets$Experiment)
design = model.matrix(~0 + group + experiment)
colnames(design) = gsub("group", "", colnames(design))
design

##    Basal LP ML experiment2
## 1      0  0  1           0
## 2      0  1  0           0
## 3      1  0  0           0
## 4      0  0  1           0
## 5      0  1  0           0
## 6      1  0  0           0
## 7      0  0  1           0
## 8      0  1  0           0
## 9      1  0  0           0
## 10     0  0  1           1
## 11     0  1  0           1
## 12     1  0  0           1
## 13     1  0  0           1
## 14     0  0  1           1
## 15     0  1  0           1
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
## 
## attr(,"contrasts")$experiment
## [1] "contr.treatment"

contr.matrix = makeContrasts(
         BasalvsLP = Basal-LP,
         BasalvsML = Basal-ML,
         LPvsML = LP-ML,
         levels = colnames(design))
contr.matrix

##              Contrasts
## Levels        BasalvsLP BasalvsML LPvsML
##   Basal               1         1      0
##   LP                 -1         0      1
##   ML                  0        -1     -1
##   experiment2         0         0      0

3.2.2.1 1. Creating gene set collection indexes

As before, we will extract the mouse c2, c5 and KEGG gene signature collections from the EGSEAdata package and build indexes that link between the genes in each signature and the rows of our expression matrix.

library(EGSEA)
library(EGSEAdata)
gs.annots = buildIdx(entrezIDs=unique(probe.annot[, 2]), 
             species="mouse", 
             msigdb.gsets=c("c2", "c5"), go.part = TRUE)

## [1] "Loading MSigDB Gene Sets ... "
## [1] "Loaded gene sets for the collection c2 ..."
## [1] "Indexed the collection c2 ..."
## [1] "Created annotation for the collection c2 ..."
## [1] "Loaded gene sets for the collection c5 ..."
## [1] "Indexed the collection c5 ..."
## [1] "Created annotation for the collection c5 ..."
## MSigDB c5 gene set collection has been partitioned into 
## c5BP, c5CC, c5MF
## [1] "Building KEGG pathways annotation object ... "

names(gs.annots)

## [1] "c2"   "c5BP" "c5CC" "c5MF" "kegg"

3.2.2.2 2. Configuring and 3. Testing with EGSEA

The same 11 base methods used previously in the RNA-seq analysis were selected for the ensemble testing of the microarray data using the function egsea.ma. Gene sets were again prioritised by their median rank across the 11 methods.

baseMethods = egsea.base()[-2]
baseMethods

##  [1] "camera"     "safe"       "gage"       "padog"      "plage"     
##  [6] "zscore"     "gsva"       "ssgsea"     "globaltest" "ora"       
## [11] "fry"

gsam = egsea.ma(expr=expr, group=group, 
            probe.annot = probe.annot,
            design = design,
         contrasts=contr.matrix,  
         gs.annots=gs.annots,
         baseGSEAs=baseMethods, sort.by="med.rank",
         num.threads = 8, report = FALSE)

## EGSEA analysis has started

## ##------ Tue Jul 25 12:39:28 2017 ------##

## Log fold changes are estimated using limma package ...

## limma DE analysis is carried out ...

## Number of used cores has changed to 2
## in order to avoid CPU overloading.

## EGSEA is running on the provided data and c2 collection

##

## EGSEA is running on the provided data and c5BP collection

##

## EGSEA is running on the provided data and c5CC collection

##

## EGSEA is running on the provided data and c5MF collection

##

## EGSEA is running on the provided data and kegg collection

##

## ##------ Tue Jul 25 12:50:12 2017 ------##

## EGSEA analysis took 644.247 seconds.

## EGSEA analysis has completed

3.2.2.3 4. Reporting EGSEA results

An HTML report that includes each of the gene set level and summary level plots shown individually for the RNA-seq analysis was generated. We complete our analysis by displaying the top ranked sets for the c2 collection from a comparative analysis across all contrasts.

generateReport(gsam, number = 20, report.dir="./mam-ma-egsea-report")

## EGSEA HTML report is being generated ...

## ##------ Tue Jul 25 12:50:12 2017 ------##

## Number of used cores has changed to 2
## in order to avoid CPU overloading.

## Report pages and figures are being generated for the c2 collection ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

## Report pages and figures are being generated for the c5BP collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

## 
## Building most specific GOs .....

##  ( 8516 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 12560 GO terms and 29943 relations. )

## 
## Annotating nodes ...............

##  ( 6823 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 2883 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 3366 GO terms and 4286 relations. )

## 
## Annotating nodes ...............

##  ( 6803 genes annotated to the GO terms. )

## 
## Building most specific GOs .....

##  ( 1292 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 1563 GO terms and 3144 relations. )

## 
## Annotating nodes ...............

##  ( 6908 genes annotated to the GO terms. )

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the c5CC collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the c5MF collection ...

##    GO graphs are being generated for top-ranked GO terms
## based on med.rank ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    GO graphs are being generated for top-ranked COMPARISON
## GO terms based on med.rank ...

## Report pages and figures are being generated for the kegg collection ...

##    Pathway maps are being generated for top-ranked 
##  pathways based on logFC ...

##    Heat maps are being generated for top-ranked gene sets 
## based on logFC ...

##    Summary plots are being generated ...

##    Comparison summary plots are being generated  ...

##    There are more than 2 contrasts!

##    Pathway maps are being generated for top-ranked comparative
## pathways based on logFC ...

## ##------ Tue Jul 25 12:58:18 2017 ------##

## EGSEA report generation took 485.601 seconds.

## EGSEA report has been generated.

topSets(gsam, gs.label="c2", contrast = "comparison", names.only=TRUE)

## Extracting the top gene sets of the collection 
## c2 Curated Gene Sets for the contrast comparison
##  Sorted by med.rank

##  [1] "LIM_MAMMARY_STEM_CELL_UP"                       
##  [2] "LIM_MAMMARY_STEM_CELL_DN"                       
##  [3] "LIM_MAMMARY_LUMINAL_MATURE_DN"                  
##  [4] "CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN"
##  [5] "LIU_PROSTATE_CANCER_DN"                         
##  [6] "VECCHI_GASTRIC_CANCER_ADVANCED_VS_EARLY_UP"     
##  [7] "WANG_SMARCE1_TARGETS_UP"                        
##  [8] "PETRETTO_CARDIAC_HYPERTROPHY"                   
##  [9] "HAN_JNK_SINGALING_DN"                           
## [10] "LI_WILMS_TUMOR_VS_FETAL_KIDNEY_2_DN"

The EGSEA report generated for this dataset is available online at http://bioinf.wehi.edu.au/folders/EGSEA/mam-ma-egsea-report/index.html. Reanalysis of this data retrieves similar c2 gene sets to those identified by analysis of RNA-seq data. These included the Lim et al. (2010) (Lim et al. 2010) gene sets as well as those derived from populations with similar cellular origin (sets 4 and 7).

4 Discussion

In this workflow article, we have demonstrated how to use the EGSEA package to combine the results obtained from many gene signature databases and multiple GSE methods to find an ensemble solution. A key benefit of an EGSEA analysis is the detailed and comprehensive HTML report that includes tables prioritising gene signatures according to the user specified analysis options, and both gene set specific and summary graphics. The approach taken by EGSEA is underpinned by the diverse range of gene set testing algorithms and plotting capabilities available within Bioconductor.

It is worth noting that users are free to choose a single GSE algorithm in their analysis rather than an ensemble across many if they prefer. In this scenario, users can still benefit from EGSEA’s comprehensive reporting capability that can be easily incorporated in their gene expression analysis workflow.

5 Software and code used

This workflow makes use of various packages from version 3.5 of the Bioconductor project, running on R version 3.4.0 or higher (R Core Team 2017). The packages used and their version numbers are listed below. The EGSEAdata package contains gene signatures used by EGSEA. Code to perform this analysis is available in the EGSEA123 package available as a Bioconductor workflow from https://www.bioconductor.org/help/workflows/.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.21.0     org.Mm.eg.db_3.4.1   EGSEA_1.5.2         
##  [4] pathview_1.17.2      org.Hs.eg.db_3.4.1   topGO_2.29.0        
##  [7] SparseM_1.77         GO.db_3.4.1          graph_1.55.0        
## [10] AnnotationDbi_1.39.1 IRanges_2.11.12      S4Vectors_0.15.5    
## [13] gage_2.27.1          Biobase_2.37.2       BiocGenerics_0.23.0 
## [16] EGSEAdata_1.5.0      edgeR_3.19.3         limma_3.33.5        
## [19] BiocStyle_2.5.8     
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.3-2         hwriter_1.3.2           
##   [3] rprojroot_1.2            R2HTML_2.3.2            
##   [5] XVector_0.17.0           base64_2.0              
##   [7] KEGGdzPathwaysGEO_1.15.0 DT_0.2                  
##   [9] bit64_0.9-7              Glimma_1.5.0            
##  [11] KEGG.db_3.2.3            codetools_0.2-15        
##  [13] splines_3.4.1            geneplotter_1.55.0      
##  [15] knitr_1.16               shinythemes_1.1.1       
##  [17] jsonlite_1.5             annotate_1.55.0         
##  [19] png_0.1-7                globaltest_5.31.0       
##  [21] shiny_1.0.3              compiler_3.4.1          
##  [23] httr_1.2.1               backports_1.1.0         
##  [25] assertthat_0.2.0         Matrix_1.2-10           
##  [27] lazyeval_0.2.0           org.Rn.eg.db_3.4.1      
##  [29] htmltools_0.3.6          tools_3.4.1             
##  [31] bindrcpp_0.2             glue_1.1.1              
##  [33] gtable_0.2.0             dplyr_0.7.2             
##  [35] doRNG_1.6.6              Rcpp_0.12.12            
##  [37] Biostrings_2.45.3        gdata_2.18.0            
##  [39] nlme_3.1-131             iterators_1.0.8         
##  [41] stringr_1.2.0            mime_0.5                
##  [43] rngtools_1.2.4           gtools_3.5.0            
##  [45] XML_3.98-1.9             zlibbioc_1.23.0         
##  [47] scales_0.4.1             KEGGgraph_1.37.1        
##  [49] hgu133a.db_3.2.3         RColorBrewer_1.1-2      
##  [51] yaml_2.1.14              memoise_1.1.0           
##  [53] ggplot2_2.2.1            pkgmaker_0.22           
##  [55] stringi_1.1.5            RSQLite_2.0             
##  [57] GSVA_1.25.4              foreach_1.4.3           
##  [59] caTools_1.17.1           safe_3.17.0             
##  [61] GSA_1.03                 rlang_0.1.1             
##  [63] pkgconfig_2.0.1          matrixStats_0.52.2      
##  [65] bitops_1.0-6             evaluate_0.10.1         
##  [67] lattice_0.20-35          purrr_0.2.2.2           
##  [69] bindr_0.1                labeling_0.3            
##  [71] htmlwidgets_0.9          HTMLUtils_0.1.7         
##  [73] bit_1.1-12               GSEABase_1.39.0         
##  [75] plyr_1.8.4               magrittr_1.5            
##  [77] R6_2.2.2                 gplots_3.0.1            
##  [79] DBI_0.7                  survival_2.41-3         
##  [81] KEGGREST_1.17.0          RCurl_1.95-4.8          
##  [83] tibble_1.3.3             KernSmooth_2.23-15      
##  [85] plotly_4.7.0             rmarkdown_1.6           
##  [87] locfit_1.5-9.1           data.table_1.10.4       
##  [89] blob_1.1.0               metap_0.8               
##  [91] digest_0.6.12            xtable_1.8-2            
##  [93] tidyr_0.6.3              httpuv_1.3.5            
##  [95] illuminaio_0.19.0        openssl_0.9.6           
##  [97] munsell_0.4.3            viridisLite_0.2.0       
##  [99] registry_0.3             hgu133plus2.db_3.2.3    
## [101] PADOG_1.19.0

6 Author contributions

All authors were involved in writing and contributing code for the article.

7 Competing interests

The authors declare that they have no competing interests.

8 Grant information

This worked was funded by National Health and Medical Research Council (NHMRC) Fellowship GNT1104924 to MER, Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIISS.

9 ReferencesShow in New WindowClear OutputExpand/Collapse Output

trying URL ‘http://bioinf.wehi.edu.au/EGSEA/mam.rnaseq.rdata’ Content type ‘unknown’ length 1786017 bytes (1.7 MB) ================================================== downloaded 1.7 MB

Alexa, A., and J. Rahnenfuhrer. 2016. TopGO: Enrichment Analysis for Gene Ontology.

Alhamdoosh, M., M. Ng, and M. E. Ritchie. 2017. EGSEA: Ensemble of Gene Set Enrichment Analyses. https://bioconductor.org/packages/devel/bioc/html/EGSEA.html.

Alhamdoosh, M., M. Ng, N. J. Wilson, J. M. Sheridan, H. Huynh, M. J. Wilson, and M. E. Ritchie. 2017. “Combining Multiple Tools Outperforms Individual Methods in Gene Set Enrichment Analyses.” Bioinformatics 33 (3): 414. doi:10.1093/bioinformatics/btw623.

Araki, H., C. Knapp, P. Tsai, and C. Print. 2012. “GeneSetDB: A comprehensive meta-database, statistical and visualisation framework for gene set analysis.” FEBS Open Bio 2: 76–82.

Barbie, D. A., P. Tamayo, J. S. Boehm, S. Y. Kim, S. E. Moody, I. F. Dunn, A. C. Schinzel, et al. 2009. “Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.” Nature 462 (7269): 108–12.

Barry, W. T., A. B. Nobel, and F. A. Wright. 2005. “Significance analysis of functional categories in gene expression studies: a structured permutation approach.” Bioinformatics 21 (9): 1943–9.

Bioconductor Core Team. 2015. Mus.musculus: Annotation package for the Mus.musculus object. http://bioconductor.org/packages/Mus.musculus/.

Goeman, J. J., S. A. van de Geer, F. de Kort, and H. C. van Houwelingen. 2004. “A global test for groups of genes: testing association with a clinical outcome.” Bioinformatics 20 (1): 93–99.

Hänzelmann, S., R. Castelo, and J. Guinney. 2013. “GSVA: gene set variation analysis for microarray and RNA-seq data.” BMC Bioinformatics 14: 7.

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.

Kanehisa, M., and S. Goto. 2000. “KEGG: kyoto encyclopedia of genes and genomes.” Nucleic Acids Research 28 (1): 27–30.

Law, C. W., M. Alhamdoosh, S. Su, G. K. Smyth, and M. E. Ritchie. 2016. “RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR.” F1000Research 5: 1408. doi:10.12688/f1000research.9005.1.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biology 15 (2): R29.

Lee, E., H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee. 2008. “Inferring pathway activity toward precise disease classification.” PLoS Computational Biology 4 (11): e1000217.

Liao, Y., G. K. Smyth, and W. Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Research 41 (10): e108.

———. 2014. “featureCounts: an Efficient General-Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.

Lim, E., D. Wu, B. Pal, T. Bouras, M. L. Asselin-Labat, F. Vaillant, H. Yagita, G. J. Lindeman, G. K. Smyth, and J. E. Visvader. 2010. “Transcriptome analyses of mouse and human mammary cell subpopulations reveal multiple conserved genes and pathways.” Breast Cancer Research 12 (2): R21.

Luo, W., and C. Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. doi:10.1093/bioinformatics/btt285.

Luo, W., M. S. Friedman, K. Shedden, K. D. Hankenson, and P. J. Woolf. 2009. “GAGE: generally applicable gene set enrichment for pathway analysis.” BMC Bioinformatics 10: 161.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47.

Robinson, M. D., and A. Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-seq data.” Genome Biology 11 (3): R25.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Sheridan, J. M., M. E. Ritchie, S. A. Best, K. Jiang, T. J. Beck, F. Vaillant, K. Liu, et al. 2015. “A pooled shRNA screen for regulators of primary mammary stem and progenitor cells identifies roles for Asap1 and Prox1.” BMC Cancer 15 (1). BioMed Central: 221.

Shi, W., A. Oshlack, and G.K. Smyth. 2010. “Optimizing the Noise Versus Bias Trade-Off for Illumina Whole Genome Expression BeadChips.” Nucleic Acids Research 38 (22). Oxford University Press: e204.

Smith, M. L., K. A. Baggerly, H. Bengtsson, M. E. Ritchie, and K. D. Hansen. 2013. “illuminaio: an Open Source Idat Parsing Tool.” F1000Research 2: 264. http://f1000research.com/articles/2-264/.

Smyth, G. K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Stat Appl Genet Mol Biol 3 (1): Article 3.

Su, S., C. W. Law, C. Ah-Cann, M. L. Asselin-Labat, M. E. Blewitt, and M. E. Ritchie. 2017. “Glimma: Interactive Graphics for Gene Expression Analysis.” Bioinformatics 33 (13): 2050–2. doi:10.1093/bioinformatics/btx094.

Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc Natl Acad Sci U S A 102 (43): 15545–50.

Tarca, A. L., S. Draghici, G. Bhatti, and R. Romero. 2012. “Down-weighting overlapping genes improves gene set analysis.” BMC Bioinformatics 13: 136.

Tavazoie, S., J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church. 1999. “Systematic determination of genetic network architecture.” Nature Genetics 22 (3): 281–5.

Tomfohr, J., J. Lu, and T. B. Kepler. 2005. “Pathway level analysis of gene expression using singular value decomposition.” BMC Bioinformatics 6: 225.

Wilkinson, B. 1954. “A Statistical Consideration in Psychological Research.” Psychological Bulletin 48: 156–8.

Wu, D., and G. K. Smyth. 2012. “Camera: a competitive gene set test accounting for inter-gene correlation.” Nucleic Acids Research 40 (17): e133.

Wu, D., E. Lim, F. Vaillant, M. L. Asselin-Labat, J. E. Visvader, and G. K. Smyth. 2010. “ROAST: rotation gene set tests for complex microarray experiments.” Bioinformatics 26 (17): 2176–82.

Easy and efficient ensemble gene set testing with EGSEA

Monther Alhamdoosh

Charity W. Law

Luyi Tian

Julie M. Sheridan

Milica Ng

Matthew E. Ritchie

14 July 2017

Contents