Using fgsea package

Alexey Sergushichev

2016-06-22

fgsea is an R-package for fast preranked gene set enrichment analysis (GSEA). The performance is achieved by using an algorithm for cumulative GSEA-statistic calculation. This allows to reuse samples between different gene set sizes. See the preprint for algorithmic details.

Loading necessary libraryries

Quick run

Loading example pathways and gene-level statistics:

data(examplePathways)
data(exampleRanks)

Running fgsea:

fgseaRes <- fgsea(pathways = examplePathways, 
                  stats = exampleRanks,
                  minSize=15,
                  maxSize=500,
                  nperm=10000)

The resulting table contains enrichment scores and p-values:

head(fgseaRes[order(pval), ])
##                                pathway         pval        padj        ES
## 1:                  5990980_Cell_Cycle 0.0001225340 0.001826057 0.5388497
## 2:         5990979_Cell_Cycle,_Mitotic 0.0001256755 0.001826057 0.5594755
## 3:    5991210_Signaling_by_Rho_GTPases 0.0001318218 0.001826057 0.4238512
## 4:                     5991454_M_Phase 0.0001366867 0.001826057 0.5576247
## 5: 5991023_Metabolism_of_carbohydrates 0.0001391208 0.001826057 0.4944766
## 6:        5991209_RHO_GTPase_Effectors 0.0001394506 0.001826057 0.5248796
##         NES nMoreExtreme size                                leadingEdge
## 1: 2.676639            0  369   66336,66977,12442,107995,66442,19361,...
## 2: 2.736077            0  317   66336,66977,12442,107995,66442,12571,...
## 3: 2.007249            0  231 66336,66977,20430,104215,233406,107995,...
## 4: 2.543706            0  173   66336,66977,12442,107995,66442,52276,...
## 5: 2.227687            0  160    11676,21991,15366,58250,12505,20527,...
## 6: 2.359986            0  157 66336,66977,20430,104215,233406,107995,...

It takes about ten seconds to get results with significant hits after FDR correction:

sum(fgseaRes[, padj < 0.01])
## [1] 80

One can make an enrichment plot for a pathway:

plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
               exampleRanks) + labs(title="Programmed Cell Death")

Or make a table plot for a bunch of selected pathways:

topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes, 
              gseaParam = 0.5)

From the plot above one can see that there are very similar pathways in the table (for example 5991502_Mitotic_Metaphase_and_Anaphase and 5991600_Mitotic_Anaphase). To select only independent pathways one can use collapsePathways function:

collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.01], 
                                      examplePathways, exampleRanks)
mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][
                         order(-NES), pathway]
plotGseaTable(examplePathways[mainPathways], exampleRanks, fgseaRes, 
              gseaParam = 0.5)

To save the results data:table::fwrite function can be used:

library(data.table)
fwrite(fgseaRes, file="fgseaRes.txt", sep="\t", sep2=c("", " ", ""))

To make leading edge more human-readable it can be converted using mapIds function and a corresponding database (here org.Mm.eg.db for mouse):

library(org.Mm.eg.db)
fgseaResMain <- fgseaRes[match(mainPathways, pathway)]
fgseaResMain[, leadingEdge := lapply(leadingEdge, mapIds, x=org.Mm.eg.db, keytype="ENTREZID", column="SYMBOL")]
fwrite(fgseaResMain, file="fgseaResMain.txt", sep="\t", sep2=c("", " ", ""))

Performance considerations

Please, be aware that fgsea function takes about O(nk^{3/2}) time, where n is number of permutations and k is a maximal size of the pathways. That means that setting maxSize parameter with a value of ~500 is strongly recommended.

Also, fgsea is parallelized using BiocParallel package. By default the first registered backend returned by bpparam() is used. To tweak the parallelization one can either specify BPPARAM parameter used for bclapply of set nproc parameter, which is a shorthand for setting BPPARAM=MulticoreParam(workers = nproc).

Using Reactome pathways

For convenience there is reactomePathways function that obtains pathways from Reactome for given set of genes. Package reactome.db is required to be installed.

pathways <- reactomePathways(names(exampleRanks))
fgseaRes <- fgsea(pathways, exampleRanks, nperm=1000, maxSize=500)
head(fgseaRes)
##                                                            pathway
## 1:                      5-Phosphoribose 1-diphosphate biosynthesis
## 2: A tetrasaccharide linker sequence is required for GAG synthesis
## 3:                           ABC transporters in lipid homeostasis
## 4:                          ABC-family proteins mediated transport
## 5:                       ADP signalling through P2Y purinoceptor 1
## 6:                      ADP signalling through P2Y purinoceptor 12
##          pval      padj         ES        NES nMoreExtreme size
## 1: 0.84749035 0.9279258  0.4267378  0.6861681          438    2
## 2: 0.64990689 0.8572659  0.3218180  0.8324689          348   11
## 3: 0.33557047 0.6732838 -0.3850118 -1.1074888          149   13
## 4: 0.15349544 0.4637333  0.3098537  1.2560422          100   68
## 5: 0.01388889 0.1113237  0.6228710  1.7985470            7   16
## 6: 0.03423423 0.1893732  0.5959844  1.6274476           18   13
##                                 leadingEdge
## 1:                             328099,19139
## 2: 14733,20971,20970,12032,29873,218271,...
## 3: 19299,27403,11307,11806,217265,27409,...
## 4:  18669,56199,17463,26440,26444,19179,...
## 5:  14696,14702,14700,14682,14676,66066,...
## 6:  14696,14702,14700,66066,14697,14693,...

Starting from files

One can also start from .rnk and .gmt files as in original GSEA:

rnk.file <- system.file("extdata", "naive.vs.th1.rnk", package="fgsea")
gmt.file <- system.file("extdata", "mouse.reactome.gmt", package="fgsea")

Loading ranks:

ranks <- read.table(rnk.file,
                    header=TRUE, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$t, ranks$ID)
str(ranks)
##  Named num [1:12000] -63.3 -49.7 -43.6 -41.5 -33.3 ...
##  - attr(*, "names")= chr [1:12000] "170942" "109711" "18124" "12775" ...

Loading pathways:

pathways <- gmtPathways(gmt.file)
str(head(pathways))
## List of 6
##  $ 1221633_Meiotic_Synapsis                                                : chr [1:64] "12189" "13006" "15077" "15078" ...
##  $ 1368092_Rora_activates_gene_expression                                  : chr [1:9] "11865" "12753" "12894" "18143" ...
##  $ 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression           : chr [1:16] "11865" "11998" "12753" "12952" ...
##  $ 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane                   : chr [1:55] "11461" "11465" "11651" "11652" ...
##  $ 186574_Endocrine-committed_Ngn3+_progenitor_cells                       : chr [1:4] "18012" "18088" "18506" "53626"
##  $ 186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells: chr [1:4] "11925" "15205" "21410" "246086"

And runnig fgsea:

fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize=500, nperm=1000)
head(fgseaRes)
##                                                                                    pathway
## 1:                                                                1221633_Meiotic_Synapsis
## 2:                                   1445146_Translocation_of_Glut4_to_the_Plasma_Membrane
## 3: 442533_Transcriptional_Regulation_of_Adipocyte_Differentiation_in_3T3-L1_Pre-adipocytes
## 4:                                                                  508751_Circadian_Clock
## 5:                                               5334727_Mus_musculus_biological_processes
## 6:                                        573389_NoRC_negatively_regulates_rRNA_expression
##         pval      padj         ES        NES nMoreExtreme size
## 1: 0.5515464 0.7328938  0.2885754  0.9294618          320   27
## 2: 0.6903226 0.8462950  0.2387284  0.8391098          427   39
## 3: 0.0900000 0.2397273 -0.3640706 -1.3246027           35   31
## 4: 0.8021201 0.9091729  0.2516324  0.7177504          453   17
## 5: 0.3842730 0.5978372  0.2469065  1.0486661          258  106
## 6: 0.4257951 0.6403785  0.3607407  1.0289683          240   17
##                                 leadingEdge
## 1:                  15270,12189,71846,19357
## 2:  17918,19341,20336,22628,22627,20619,...
## 3: 76199,19014,26896,229003,17977,17978,...
## 4:                        20893,59027,19883
## 5:  60406,19361,15270,20893,12189,68240,...
## 6:                 60406,20018,245688,20017