Typical RNA-seq call from DESeq2

Note: the typical RNA-seq workflow for users would be to call apeglm estimation from within the lfcShrink function from the DESeq2 package. The unevaluated code chunk shows how to obtain apeglm shrinkage estimates after running DESeq. See the DESeq2 vignette for more details. The lfcShrink wrapper function takes care of many details below, and unifies the interface for multiple shrinkage estimators. The coefficient to shrink can be specified either by name or by number (following the order in resultsNames(dds)). Be aware that DESeq2’s lfcShrink interface provides LFCs on the log2 scale, while apeglm provides coefficients on the natural log scale.

res <- lfcShrink(dds, coef=2, type="apeglm")

Acknowledgments

Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette.

We have benefited in the development of apeglm from feedback or contributions from the following individuals:

Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson

Example RNA-seq analysis

Here we show example code which mimics what will happen inside the lfcShrink function when using the apeglm method (Zhu, Ibrahim, and Love 2018).

Load a prepared SummarizedExperiment:

library(airway)
data(airway)
head(assay(airway))
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0

For demonstration, we will use 3000 genes of the airway dataset, the first from those genes with at least 10 counts across all samples.

keep <- head(which(rowSums(assay(airway)) >= 10), 3000)
airway <- airway[keep,]

First run a DESeq2 differential expression analysis (Love, Huber, and Anders 2014) (size factors, and dispersion estimates could similarly be estimated using edgeR):

library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)

Defining data and parameter objects necessary for apeglm. We must multiply the coefficients from DESeq2 by a factor, because apeglm provides natural log coefficients. Again, this would be handled inside of lfcShrink in DESeq2 for a typical RNA-seq analysis.

x <- model.matrix(design(dds), colData(dds))
param <- dispersions(dds)
mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE)
offset <- matrix(log(sizeFactors(dds)),
                 ncol=ncol(dds),
                 nrow=nrow(dds),byrow=TRUE)

Running apeglm

Here apeglm on 3000 genes takes a few seconds on a laptop. It scales with number of genes, the number of samples and the number of variables in the design formula, where here we have 5 coefficients (one for the four cell cultures and one for the difference due to dexamethasone treatment).

We provide apeglm with the SummarizedExperiment although the function can also run on a matrix of counts or other observed data. We specify a coef as well as a threshold which we discuss below. Note that we multiple the threshold by log(2) to convert from log2 scale to natural log scale.

The original interface to apeglm looked as follows, but was slow, we therefore recommend using the faster interfaces shown below.

# original interface, also may give Lapack error
fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x),
              threshold=log(2) * 1, mle=mle, offset=offset)

There are better, faster implementations of apeglm specifically for negative binomial likelihoods. The version nbinomR is ~5 times faster than the default method="general".

We will use this run for downstream analysis. Here we pass the SummarizedExperiment, which will pass back the ranges. For faster apeglm runs, provide only the counts matrix (as seen in subsequent chunks).

library(apeglm)
system.time({
  fitR <- apeglm(Y=airway, x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR")
})
##    user  system elapsed 
##   2.893   0.011   2.904
fit <- fitR
names(fit)
## [1] "map"           "sd"            "prior.control" "fsr"          
## [5] "svalue"        "interval"      "thresh"        "diag"         
## [9] "ranges"
str(fit$prior.control)
## List of 7
##  $ no.shrink            : int [1:4] 1 2 3 4
##  $ prior.mean           : num 0
##  $ prior.scale          : num 0.304
##  $ prior.df             : num 1
##  $ prior.no.shrink.mean : num 0
##  $ prior.no.shrink.scale: num 15
##  $ prior.var            : num 0.0921

The version nbinomCR is ~10 times faster than the default general.

system.time({
  fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR")
})
##    user  system elapsed 
##   1.577   0.017   1.595

The version nbinomC returns only the MAP coefficients and can be ~50-100 times faster than the default general. The MAP coefficients are the same as returned by nbinomCR above, we just skip the calculation of posterior SD. A variant of nbinomC is nbinomC* which includes random starts.

system.time({
  fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC")
})
##    user  system elapsed 
##   0.116   0.000   0.116

Among other output, we have the estimated coefficients attached to the ranges of the SummarizedExperiment used as input:

class(fit$ranges)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
mcols(fit$ranges, use.names=TRUE)
## DataFrame with 3000 rows and 5 columns
##                 X.Intercept. cellN061011 cellN080611 cellN61311     dextrt
##                    <numeric>   <numeric>   <numeric>  <numeric>  <numeric>
## ENSG00000000003      6.64788   0.0546137   0.2242529 -0.1570180 -0.2602199
## ENSG00000000419      6.23638  -0.0912014  -0.0206726 -0.0526312  0.1203700
## ENSG00000000457      5.45524   0.0540325  -0.0611456  0.0435641  0.0133132
## ENSG00000000460      3.77417   0.5394114   0.2564594  0.3450027 -0.0420874
## ENSG00000000971      8.53432   0.1733404   0.1364808 -0.4750196  0.2732967
## ...                      ...         ...         ...        ...        ...
## ENSG00000107731      6.63160  0.00129427  -0.3588497  0.2327120 -0.6946087
## ENSG00000107736      2.06827 -0.47986419  -0.5995966 -0.4171680 -0.0224223
## ENSG00000107738      8.62310 -0.14041395  -0.8643571 -0.2631779  0.2137555
## ENSG00000107742      2.06485  0.10029416  -0.4432328  0.0207201 -0.0116752
## ENSG00000107745      7.45809  0.01255731   0.0417601 -0.1318218 -0.0729778

We can compare the coefficients from apeglm with the "normal" shrinkage type from the original DESeq2 paper (2014). This method, which makes use of a Normal-based prior, is no longer the default shrinkage estimator for lfcShrink. apeglm provides coefficients on the natural log scale, so we must convert to log2 scale by multiplying by log2(exp(1)). Note that DESeq2’s lfcShrink function converts apeglm coefficients to the log2 scale internally.

system.time({
  res.shr <- lfcShrink(dds, coef=5, type="normal")
})
##    user  system elapsed 
##   1.350   0.013   1.362
DESeq2.lfc <- res.shr$log2FoldChange
apeglm.lfc <- log2(exp(1)) * fit$map[,5]

Here we plot apeglm estimators against DESeq2:

plot(DESeq2.lfc, apeglm.lfc)
abline(0,1)

Here we plot MLE, DESeq2 and apeglm estimators against the mean of normalized counts:

par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
     ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
     ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
     ylim=lims, main="apeglm", xlab=xlab)
hline()