Abstract
apeglm provides empirical Bayes shrinkage estimators for effect sizes for a variety of GLM models; apeglm stands for “Approximate Posterior Estimation for GLM”. apeglm package version: 1.29.0
Note: the typical RNA-seq workflow for users would
be to call apeglm estimation from within the
lfcShrink
function from the DESeq2 package. The
unevaluated code chunk shows how to obtain apeglm shrinkage
estimates after running DESeq
. See the DESeq2 vignette for
more details. The lfcShrink
wrapper function takes care of
many details below, and unifies the interface for multiple shrinkage
estimators. The coefficient to shrink can be specified either by name or
by number (following the order in resultsNames(dds)
). Be
aware that DESeq2’s lfcShrink
interface provides
LFCs on the log2 scale, while apeglm provides coefficients on
the natural log scale.
Joshua Zitovsky contributed fast C++ code for the beta-binomial likelihood, demonstrated in a later section of the vignette.
We have benefited in the development of apeglm
from
feedback or contributions from the following individuals:
Wolfgang Huber, Cecile Le Sueur, Charlotte Soneson
Here we show example code which mimics what will happen inside the
lfcShrink
function when using the apeglm method
(Zhu, Ibrahim, and Love 2018).
Load a prepared SummarizedExperiment
:
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
For demonstration, we will use 3000 genes of the airway
dataset, the first from those genes with at least 10 counts across all
samples.
First run a DESeq2 differential expression analysis (Love, Huber, and Anders 2014) (size factors, and dispersion estimates could similarly be estimated using edgeR):
library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)
Defining data and parameter objects necessary for
apeglm
. We must multiply the coefficients from
DESeq2 by a factor, because apeglm provides natural
log coefficients. Again, this would be handled inside of
lfcShrink
in DESeq2 for a typical RNA-seq
analysis.
Here apeglm
on 3000 genes takes a few seconds on a
laptop. It scales with number of genes, the number of samples and the
number of variables in the design formula, where here we have 5
coefficients (one for the four cell cultures and one for the difference
due to dexamethasone treatment).
We provide apeglm
with the SummarizedExperiment
although the function can also run on a matrix of counts or
other observed data. We specify a coef
as well as a
threshold
which we discuss below. Note that we multiple the
threshold
by log(2)
to convert from log2 scale
to natural log scale.
The original interface to apeglm looked as follows, but was slow, we therefore recommend using the faster interfaces shown below.
# original interface, also may give Lapack error
fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x),
threshold=log(2) * 1, mle=mle, offset=offset)
There are better, faster implementations of apeglm specifically for
negative binomial likelihoods. The version nbinomR
is ~5
times faster than the default method="general"
.
We will use this run for downstream analysis. Here we pass the
SummarizedExperiment, which will pass back the ranges. For
faster apeglm
runs, provide only the counts matrix (as seen
in subsequent chunks).
library(apeglm)
system.time({
fitR <- apeglm(Y=airway, x=x, log.lik=NULL, param=param, coef=ncol(x),
threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR")
})
## user system elapsed
## 2.893 0.011 2.904
## [1] "map" "sd" "prior.control" "fsr"
## [5] "svalue" "interval" "thresh" "diag"
## [9] "ranges"
## List of 7
## $ no.shrink : int [1:4] 1 2 3 4
## $ prior.mean : num 0
## $ prior.scale : num 0.304
## $ prior.df : num 1
## $ prior.no.shrink.mean : num 0
## $ prior.no.shrink.scale: num 15
## $ prior.var : num 0.0921
The version nbinomCR
is ~10 times faster than the
default general
.
system.time({
fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR")
})
## user system elapsed
## 1.577 0.017 1.595
The version nbinomC
returns only the MAP coefficients
and can be ~50-100 times faster than the default general
.
The MAP coefficients are the same as returned by nbinomCR
above, we just skip the calculation of posterior SD. A variant of
nbinomC
is nbinomC*
which includes random
starts.
system.time({
fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC")
})
## user system elapsed
## 0.116 0.000 0.116
Among other output, we have the estimated coefficients attached to the ranges of the SummarizedExperiment used as input:
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
## DataFrame with 3000 rows and 5 columns
## X.Intercept. cellN061011 cellN080611 cellN61311 dextrt
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 6.64788 0.0546137 0.2242529 -0.1570180 -0.2602199
## ENSG00000000419 6.23638 -0.0912014 -0.0206726 -0.0526312 0.1203700
## ENSG00000000457 5.45524 0.0540325 -0.0611456 0.0435641 0.0133132
## ENSG00000000460 3.77417 0.5394114 0.2564594 0.3450027 -0.0420874
## ENSG00000000971 8.53432 0.1733404 0.1364808 -0.4750196 0.2732967
## ... ... ... ... ... ...
## ENSG00000107731 6.63160 0.00129427 -0.3588497 0.2327120 -0.6946087
## ENSG00000107736 2.06827 -0.47986419 -0.5995966 -0.4171680 -0.0224223
## ENSG00000107738 8.62310 -0.14041395 -0.8643571 -0.2631779 0.2137555
## ENSG00000107742 2.06485 0.10029416 -0.4432328 0.0207201 -0.0116752
## ENSG00000107745 7.45809 0.01255731 0.0417601 -0.1318218 -0.0729778
We can compare the coefficients from apeglm with the
"normal"
shrinkage type from the original DESeq2
paper (2014). This method, which makes use of a Normal-based prior, is
no longer the default shrinkage estimator for lfcShrink
.
apeglm provides coefficients on the natural log scale, so we
must convert to log2 scale by multiplying by log2(exp(1))
.
Note that DESeq2’s lfcShrink
function converts
apeglm coefficients to the log2 scale internally.
## user system elapsed
## 1.350 0.013 1.362
Here we plot apeglm estimators against DESeq2:
Here we plot MLE, DESeq2 and apeglm estimators against the mean of normalized counts:
par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
ylim=lims, main="apeglm", xlab=xlab)
hline()