License: GPL (>= 2)

1 Quick start

GSVA is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:

install.packages("BiocManager")
BiocManager::install("GSVA")

Once GSVA is installed, it can be loaded with the following command.

library(GSVA)

Given a gene expression data matrix, which we shall call X, with rows corresponding to genes and columns to samples, such as this one simulated from random Gaussian data:

p <- 10000 ## number of genes
n <- 30    ## number of samples
## simulate expression values from a standard Gaussian distribution
X <- matrix(rnorm(p*n), nrow=p,
            dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
X[1:5, 1:5]
           s1          s2         s3         s4         s5
g1 -0.4569632 -0.86080840 -0.7739104 -0.8647978 -0.1864430
g2  1.4439029 -2.49754087  0.1736650 -0.4029554 -0.2956561
g3  0.7817078  0.05577512  0.3051028 -0.4960997  0.6091930
g4 -0.5615320 -0.51811190 -0.5634798 -1.7362284  0.8721959
g5  0.1441283  0.15219240  1.2299267 -1.2360226 -0.3307466

Given a collection of gene sets stored, for instance, in a list object, which we shall call gs, with genes sampled uniformly at random without replacement into 100 different gene sets:

## sample gene set sizes
gs <- as.list(sample(10:100, size=100, replace=TRUE))
## sample gene sets
gs <- lapply(gs, function(n, p)
                   paste0("g", sample(1:p, size=n, replace=FALSE)), p)
names(gs) <- paste0("gs", 1:length(gs))

We can calculate GSVA enrichment scores as follows. First we should build a parameter object for the desired methodology. Here we illustrate it with the GSVA algorithm of Hänzelmann, Castelo, and Guinney (2013) by calling the function gsvaParam(), but other parameter object constructor functions are available; see in the next section below.

gsvaPar <- gsvaParam(X, gs)
gsvaPar
A GSVA::gsvaParam object
expression data:
  matrix [10000, 30]
    rows: g1, g2, ..., g10000 (10000 total)
    cols: s1, s2, ..., s30 (30 total)
using assay: none
using annotation:
  geneIdType: Null
gene sets:
  list
    names: gs1, gs2, ..., gs100 (100 total)
    unique identifiers: g2127, g5478, ..., g8331 (4058 total)
gene set size: [1, Inf]
kcdf: auto
kcdfNoneMinSampleSize: 50
tau: 1
maxDiff: TRUE
absRanking: FALSE
sparse: TRUE

The first argument to the gsvaParam() function constructing this parameter object is the gene expression data matrix, and the second is the collection of gene sets. In this example, we provide expression data and gene sets into base R matrix and list objects, respectively, to the gsvaParam() function, but it can take also different specialized containers that facilitate the access and manipulation of molecular and phenotype data, as well as their associated metadata.

Second, we call the gsva() function with the parameter object as first argument. Other additional arguments to the gsva() function are verbose to control progress reporting and BPPPARAM to perform calculations in parallel through the package BiocParallel.

gsva.es <- gsva(gsvaPar, verbose=FALSE)
dim(gsva.es)
[1] 100  30
gsva.es[1:5, 1:5]
             s1          s2          s3          s4            s5
gs1 -0.20849414  0.15253918 -0.33939482  0.20971326  0.0702530829
gs2 -0.03304830 -0.06123733 -0.16791992  0.08270303 -0.1706338873
gs3 -0.08458909  0.31653546  0.18562071 -0.05835282 -0.3594488533
gs4  0.13276229  0.07656845 -0.07412136  0.06527689 -0.0005632903
gs5  0.05560283 -0.11336352 -0.01310951 -0.10594109  0.1739966819

2 Introduction

Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix. This resulting expression data matrix can be then used with classical analytical methods such as differential expression, classification, survival analysis, clustering or correlation analysis in a pathway-centric manner. One can also perform sample-wise comparisons between pathways and other molecular data types such as microRNA expression or binding data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).

The GSVA package provides an implementation of this approach for the following methods:

  • plage (Tomfohr, Lu, and Kepler 2005). Pathway level analysis of gene expression (PLAGE) standardizes expression profiles over the samples and then, for each gene set, it performs a singular value decomposition (SVD) over its genes. The coefficients of the first right-singular vector are returned as the estimates of pathway activity over the samples. Note that, because of how SVD is calculated, the sign of its singular vectors is arbitrary.

  • zscore (Lee et al. 2008). The z-score method standardizes expression profiles over the samples and then, for each gene set, combines the standardized values as follows. Given a gene set \(\gamma=\{1,\dots,k\}\) with standardized values \(z_1,\dots,z_k\) for each gene in a specific sample, the combined z-score \(Z_\gamma\) for the gene set \(\gamma\) is defined as: \[ Z_\gamma = \frac{\sum_{i=1}^k z_i}{\sqrt{k}}\,. \]

  • ssgsea (Barbie et al. 2009). Single sample GSEA (ssGSEA) is a non-parametric method that calculates a gene set enrichment score per sample as the normalized difference in empirical cumulative distribution functions (CDFs) of gene expression ranks inside and outside the gene set. By default, the implementation in the GSVA package follows the last step described in (Barbie et al. 2009, online methods, pg. 2) by which pathway scores are normalized, dividing them by the range of calculated values. This normalization step may be switched off using the argument ssgsea.norm in the call to the gsva() function; see below.

  • gsva (Hänzelmann, Castelo, and Guinney 2013). This is the default method of the package and similarly to ssGSEA, is a non-parametric method that uses the empirical CDFs of gene expression ranks inside and outside the gene set, but it starts by calculating an expression-level statistic that brings gene expression profiles with different dynamic ranges to a common scale.

The interested user may find full technical details about how these methods work in their corresponding articles cited above. If you use any of them in a publication, please cite them with the given bibliographic reference.

3 Overview of the GSVA functionality

The workhorse of the GSVA package is the function gsva(), which takes a parameter object as its main input. There are four classes of parameter objects corresponding to the methods listed above, and may have different additional parameters to tune, but all of them require at least the following two input arguments:

  1. A normalized gene expression dataset, which can be provided in one of the following containers:
    • A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    • An ExpressionSet object; see package Biobase.
    • A SummarizedExperiment object, see package SummarizedExperiment.
  2. A collection of gene sets; which can be provided in one of the following containers:
    • A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    • A GeneSetCollection object; see package GSEABase.

One advantage of providing the input data using specialized containers such as ExpressionSet, SummarizedExperiment and GeneSetCollection is that the gsva() function will automatically map the gene identifiers between the expression data and the gene sets (internally calling the function mapIdentifiers() from the package GSEABase), when they come from different standard nomenclatures, i.e., Ensembl versus Entrez, provided the input objects contain the appropriate metadata; see next section.

If either the input gene expression data is provided as a matrix object or the gene sets are provided in a list object, or both, it is then the responsibility of the user to ensure that both objects contain gene identifiers following the same standard nomenclature.

Before the actual calculations take place, the gsva() function will apply the following filters:

  1. Discard genes in the input expression data matrix with constant expression.

  2. Discard genes in the input gene sets that do not map to a gene in the input gene expression data matrix.

  3. Discard gene sets that, after applying the previous filters, do not meet a minimum and maximum size, which by default is one for the minimum size and has no limit for the maximum size.

If, as a result of applying these three filters, either no genes or gene sets are left, the gsva() function will prompt an error. A common cause for such an error at this stage is that gene identifiers between the expression data matrix and the gene sets do not belong to the same standard nomenclature and could not be mapped. This may happen because either the input data were not provided using some of the specialized containers described above or the necessary metadata in those containers that allows the software to successfully map gene identifiers, is missing.

The method employed by the gsva() function is determined by the class of the parameter object that it receives as an input. An object constructed using the gsvaParam() function runs the method described by Hänzelmann, Castelo, and Guinney (2013), but this can be changed using the parameter constructor functions plageParam(), zscoreParam(), or ssgseaParam(), corresponding to the methods briefly described in the introduction; see also their corresponding help pages.

When using gsvaParam(), the user can additionally tune the following parameters, whose default values cover most of the use cases:

  • kcdf: The first step of the GSVA algorithm brings gene expression profiles to a common scale by calculating an expression statistic through the estimation of the CDF across samples. The way in which such an estimation is performed by GSVA is controlled by the kcdf parameter, which accepts the following four possible values: (1) "auto", the default value, lets GSVA automatically decide the estimation method; (2) "Gaussian", use a Gaussian kernel, suitable for continuous expression data, such as microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs, log-RPKMs or log-TPMs units of expression; (2) "Poisson", use a Poisson kernel, suitable for integer counts, such as those derived from RNA-seq alignments; (3) "none", which will perform a direct estimation of the CDF without a kernel function.

  • kcdfNoneMinSampleSize: When kcdf="auto", this parameter decides at what minimum sample size kcdf="none", i.e., the estimation of the empirical cumulative distribution function (ECDF) of expression levels across samples is performed directly without using a kernel; see the previous kcdf parameter.

  • tau: Exponent defining the weight of the tail in the random walk. By default tau=1.

  • maxDiff: The last step of the GSVA algorithm calculates the gene set enrichment score from two Kolmogorov-Smirnov random walk statistics. This parameter is a logical flag that allows the user to specify two possible ways to do such calculation: (1) TRUE, the default value, where the enrichment score is calculated as the magnitude difference between the largest positive and negative random walk deviations. This default value gives larger enrichment scores to gene sets whose genes are concordantly activated in one direction only; (2) FALSE, where the enrichment score is calculated as the maximum distance of the random walk from zero. This approach produces a distribution of enrichment scores that is bimodal, but it can be give large enrichment scores to gene sets whose genes are not concordantly activated in one direction only.

  • absRanking: Logical flag used only when maxDiff=TRUE. By default, absRanking=FALSE and it implies that a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When absRanking=TRUE the original Kuiper statistic is used, by which the largest positive and negative random walk deviations are added together.

  • sparse: Logical flag used only when the input expression data is stored in a sparse matrix (e.g., a dgCMatrix or a SingleCellExperiment object storing the expression data in a dgCMatrix). In such as case, when sparse=TRUE (default), a sparse version of the GSVA algorithm will be applied. Otherwise, when sparse=FALSE, the classical version of the GSVA algorithm will be used.

In general, the default values for the previous parameters are suitable for most analysis settings, which usually consist of some kind of normalized continuous expression values.

4 Gene set definitions

Gene sets constitute a simple, yet useful, way to define pathways because we use pathway membership definitions only, neglecting the information on molecular interactions. Gene set definitions are a crucial input to any gene set enrichment analysis because if our gene sets do not capture the biological processes we are studying, we will likely not find any relevant insights in our data from an analysis based on these gene sets.

There are multiple sources of gene sets, the most popular ones being The Gene Ontology (GO) project and The Molecular Signatures Database (MSigDB). Sometimes gene set databases will not include the ones we need. In such a case we should either curate our own gene sets or use techniques to infer them from data.

The most basic data container for gene sets in R is the list class of objects, as illustrated before in the quick start section, where we defined a toy collection of three gene sets stored in a list object called gs:

class(gs)
[1] "list"
length(gs)
[1] 100
head(lapply(gs, head))
$gs1
[1] "g2127" "g5478" "g2215" "g5471" "g855"  "g5530"

$gs2
[1] "g7157" "g6299" "g4575" "g718"  "g2025" "g5543"

$gs3
[1] "g2958" "g2496" "g4187" "g5636" "g2811" "g4480"

$gs4
[1] "g8675" "g1447" "g8769" "g1503" "g4379" "g5054"

$gs5
[1] "g8103" "g4719" "g8225" "g3765" "g3595" "g7441"

$gs6
[1] "g6247" "g8871" "g9798" "g9046" "g1923" "g4735"

Using a Bioconductor organism-level package such as org.Hs.eg.db we can easily build a list object containing a collection of gene sets defined as GO terms with annotated Entrez gene identifiers, as follows:

library(org.Hs.eg.db)

goannot <- select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns="GO")
head(goannot)
  ENTREZID         GO EVIDENCE ONTOLOGY
1        1 GO:0002764      IBA       BP
2        1 GO:0003674       ND       MF
3        1 GO:0005576      HDA       CC
4        1 GO:0005576      IDA       CC
5        1 GO:0005576      TAS       CC
6        1 GO:0005615      HDA       CC
genesbygo <- split(goannot$ENTREZID, goannot$GO)
length(genesbygo)
[1] 18678
head(genesbygo)
$`GO:0000002`
 [1] "291"   "1890"  "4205"  "4358"  "4976"  "9361"  "10000" "55186" "55186"
[10] "80119" "84275" "84275" "92667"

$`GO:0000009`
[1] "55650" "79087"

$`GO:0000012`
 [1] "1161"      "2074"      "3981"      "7141"      "7374"      "7515"     
 [7] "23411"     "54840"     "54840"     "54840"     "55247"     "55775"    
[13] "55775"     "200558"    "100133315"

$`GO:0000014`
 [1] "2021"  "2067"  "2072"  "4361"  "4361"  "5932"  "6419"  "6419"  "6419" 
[10] "9941"  "10111" "28990" "64421"

$`GO:0000015`
[1] "2023"   "2023"   "2026"   "2027"   "387712"

$`GO:0000016`
[1] "3938" "3938" "3938"

A more sophisticated container for gene sets is the GeneSetCollection object class defined in the GSEABase package, which also provides the function getGmt() to import gene matrix transposed (GMT) files such as those provided by MSigDB into a GeneSetCollection object. The experiment data package GSVAdata provides one such object with the old (3.0) version of the C2 collection of curated genesets from MSigDB, which can be loaded as follows.

library(GSEABase)
library(GSVAdata)

data(c2BroadSets)
class(c2BroadSets)
[1] "GeneSetCollection"
attr(,"package")
[1] "GSEABase"
c2BroadSets
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

The documentation of GSEABase contains a description of the GeneSetCollection class and its associated methods.

5 Quantification of pathway activity in bulk microarray and RNA-seq data

Here we illustrate how GSVA provides an analogous quantification of pathway activity in both microarray and RNA-seq data by using two such datasets that have been derived from the same biological samples. More concretely, we will use gene expression data of lymphoblastoid cell lines (LCL) from HapMap individuals that have been profiled using both technologies (Huang et al. 2007, @pickrell_understanding_2010). These data form part of the experimental package GSVAdata and the corresponding help pages contain details on how the data were processed. We start loading these data and verifying that they indeed contain expression data for the same genes and samples, as follows:

library(Biobase)

data(commonPickrellHuang)

stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset),
                    featureNames(pickrellCountsArgonneCQNcommon_eset)))
stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
                    sampleNames(pickrellCountsArgonneCQNcommon_eset)))

Next, for the current analysis we use the subset of canonical pathways from the C2 collection of MSigDB Gene Sets. These correspond to the following pathways from KEGG, REACTOME and BIOCARTA:

canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
                                      grep("^REACTOME", names(c2BroadSets)),
                                      grep("^BIOCARTA", names(c2BroadSets)))]
canonicalC2BroadSets
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

Additionally, we extend this collection of gene sets with two formed by genes with sex-specific expression, which also form part of the GSVAdata experiment data package. Here we use the constructor function GeneSet from the GSEABase package to build the objects that we add to the GeneSetCollection object canonicalC2BroadSets.

data(genderGenesEntrez)

MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
               collectionType=BroadCollection(category="c2"),
               setName="MSY")
MSY
setName: MSY 
geneIds: 266, 84663, ..., 353513 (total: 34)
geneIdType: EntrezId
collectionType: Broad
  bcCategory: c2 (Curated)
  bcSubCategory: NA
details: use 'details(object)'
XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
               collectionType=BroadCollection(category="c2"),
               setName="XiE")
XiE
setName: XiE 
geneIds: 293, 8623, ..., 1121 (total: 66)
geneIdType: EntrezId
collectionType: Broad
  bcCategory: c2 (Curated)
  bcSubCategory: NA
details: use 'details(object)'

canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
canonicalC2BroadSets
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., XiE (835 total)
  unique identifiers: 55902, 2645, ..., 1121 (6810 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

We calculate now GSVA enrichment scores for these gene sets using first the normalized microarray data and then the normalized RNA-seq integer count data. Note that the only requirement to do the latter is to set the argument kcdf="Poisson", which is "Gaussian" by default. Note, however, that if our RNA-seq normalized expression levels would be continuous, such as log-CPMs, log-RPKMs or log-TPMs, the default value of the kcdf argument should remain unchanged.

huangPar <- gsvaParam(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets,
                      minSize=5, maxSize=500)
esmicro <- gsva(huangPar)
pickrellPar <- gsvaParam(pickrellCountsArgonneCQNcommon_eset,
                         canonicalC2BroadSets, minSize=5, maxSize=500,
                         kcdf="Poisson")
esrnaseq <- gsva(pickrellPar)

We are going to assess how gene expression profiles correlate between microarray and RNA-seq data and compare those correlations with the ones derived at pathway level. To compare gene expression values of both technologies, we will transform first the RNA-seq integer counts into log-CPM units of expression using the cpm() function from the edgeR package.

library(edgeR)

lcpms <- cpm(exprs(pickrellCountsArgonneCQNcommon_eset), log=TRUE)

We calculate Spearman correlations between gene expression profiles of the previous log-CPM values and the microarray RMA values.

genecorrs <- sapply(1:nrow(lcpms),
                    function(i, expmicro, exprnaseq)
                      cor(expmicro[i, ], exprnaseq[i, ], method="spearman"),
                    exprs(huangArrayRMAnoBatchCommon_eset), lcpms)
names(genecorrs) <- rownames(lcpms)

Now calculate Spearman correlations between GSVA enrichment scores derived from the microarray and the RNA-seq data.

pwycorrs <- sapply(1:nrow(esmicro),
                   function(i, esmicro, esrnaseq)
                     cor(esmicro[i, ], esrnaseq[i, ], method="spearman"),
                   exprs(esmicro), exprs(esrnaseq))
names(pwycorrs) <- rownames(esmicro)

Figure 1 below shows the two distributions of these correlations and we can see that GSVA enrichment scores provide an agreement between microarray and RNA-seq data comparable to the one observed between gene-level units of expression.

par(mfrow=c(1, 2), mar=c(4, 5, 3, 2))
hist(genecorrs, xlab="Spearman correlation",
     main="Gene level\n(RNA-seq log-CPMs vs microarray RMA)",
     xlim=c(-1, 1), col="grey", las=1)
hist(pwycorrs, xlab="Spearman correlation",
     main="Pathway level\n(GSVA enrichment scores)",
     xlim=c(-1, 1), col="grey", las=1)
Comparison of correlation values of gene and pathway expression profiles derived from microarray and RNA-seq data.

Figure 1: Comparison of correlation values of gene and pathway expression profiles derived from microarray and RNA-seq data

Finally, in Figure 2 we compare the actual GSVA enrichment scores for two gene sets formed by genes with sex-specific expression. Concretely, one gene set (XIE) formed by genes that escape chromosome X-inactivation in females (Carrel and Willard 2005) and another gene set (MSY) formed by genes located on the male-specific region of chromosome Y (Skaletsky et al. 2003).

par(mfrow=c(1, 2))
rmsy <- cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])
plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ],
     xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
     main=sprintf("MSY R=%.2f", rmsy), las=1, type="n")
fit <- lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ])
abline(fit, lwd=2, lty=2, col="grey")
maskPickrellFemale <- pickrellCountsArgonneCQNcommon_eset$Gender == "Female"
maskHuangFemale <- huangArrayRMAnoBatchCommon_eset$Gender == "Female"
points(exprs(esrnaseq["MSY", maskPickrellFemale]),
       exprs(esmicro)["MSY", maskHuangFemale],
       col="red", pch=21, bg="red", cex=1)
maskPickrellMale <- pickrellCountsArgonneCQNcommon_eset$Gender == "Male"
maskHuangMale <- huangArrayRMAnoBatchCommon_eset$Gender == "Male"
points(exprs(esrnaseq)["MSY", maskPickrellMale],
       exprs(esmicro)["MSY", maskHuangMale],
       col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"),
       pt.bg=c("red", "blue"), inset=0.01)
rxie <- cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])
plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ],
     xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
     main=sprintf("XiE R=%.2f", rxie), las=1, type="n")
fit <- lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ])
abline(fit, lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["XiE", maskPickrellFemale]),
       exprs(esmicro)["XiE", maskHuangFemale],
       col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["XiE", maskPickrellMale],
       exprs(esmicro)["XiE", maskHuangMale],
       col="blue", pch=21, bg="blue", cex=1)
legend("topleft", c("female", "male"), pch=21, col=c("red", "blue"),
       pt.bg=c("red", "blue"), inset=0.01)