RNA-seq workflow: gene-level exploratory analysis and differential expression

CSAMA 2019 version

Michael I. Love1,2, Charlotte Soneson3,4, Simon Anders5, Vladislav Kim6, Johannes Rainer7 and Wolfgang Huber6

1Department of Biostatistics, UNC-Chapel Hill, Chapel Hill, NC, US
2Department of Genetics, UNC-Chapel Hill, Chapel Hill, NC, US
3Friedrich Miescher Institute for Biomedical Research, Switzerland
4SIB Swiss Institute of Bioinformatics, Switzerland
5Centre for Molecular Biology, Univ. Heidelberg (ZMBH), Germany
6European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
7Institute for Biomedicine, Eurac Research, Bolzano, Italy

22 July 2019

1 Abstract
2 Introduction
- 2.1 Experimental data
3 Preparing count matrices
4 The DESeqDataSet object, sample information and the design formula
- 4.1 Creating a DGEList for use with edgeR
5 Exploratory analysis and visualization
6 Differential expression analysis
7 Plotting results
8 Removing hidden batch effects
- 8.1 Using SVA with DESeq2
- 8.2 Using RUV with DESeq2
9 Session information
References

1 Abstract

We walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were quantified with respect to a reference transcriptome, and prepare a count matrix which tallies the number of RNA-seq fragments mapped to each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis (DGE) and visually explore the results.

2 Introduction

Bioconductor has many packages which support analysis of high-throughput sequence data, including RNA sequencing (RNA-seq). The packages which we will use in this workflow include core packages maintained by the Bioconductor core team for storing genomic data sets and working with gene annotations (Huber et al. 2015). We will also use contributed packages for statistical analysis and visualization of sequencing data.

A conductor ensures the orchestra plays in harmony. (source)

Through scheduled releases every 6 months, the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor). The packages used in this workflow are loaded with the library function and can be installed by following the Bioconductor installation instructions.

A published (but essentially similar) version of this workflow, including reviewer reports and comments is available at F1000Research.

If you have questions about this workflow or any Bioconductor software, please post these to the Bioconductor support site. If the questions concern a specific package, you can tag the post with the name of the package, or for general questions about the workflow, tag the post with rnaseqgene. Note the posting guide for crafting an optimal question for the support site.

2.1 Experimental data

The data used in this workflow is stored in an R package, airway2, containing quantification data for eight RNA-seq samples. The quantification files summarize an RNA-seq experiment wherein airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778. An alternative version of the workflow uses the airway package, but this version uses the newer airway2 because it contains Salmon quantification files, which will be discussed later.

3 Preparing count matrices

As input, the count-based statistical methods, such as DESeq2 (Love, Huber, and Anders 2014), edgeR (Robinson, McCarthy, and Smyth 2009), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and baySeq (Hardcastle and Kelly 2010), expect data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of counts. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) have been assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-Seq), species of bacteria (with metagenomic datasets), or peptide sequences (with quantitative mass spectrometry).

The values in the matrix should be counts of sequencing reads/fragments. This is important for the statistical models used by DESeq2 and edgeR to hold, as only counts allow assessing the measurement precision correctly. It is important to not provide counts that were pre-normalized for sequencing depth/library size, as the statistical model is most powerful when applied to un-normalized counts and is designed to account for library size differences internally.

3.1 Transcript abundances and the tximport pipeline

In this workflow, we will use transcript abundances as quantified by the Salmon (Patro et al. 2017) software package. Salmon and other methods, such as Sailfish (Patro, Mount, and Kingsford 2014), kallisto (Bray et al. 2016), or RSEM (Li and Dewey 2011), estimate the relative abundances of all (known, annotated) transcripts without aligning reads. Because estimating the abundance of the transcripts involves an inference step, the counts are estimated. Most methods either use a statistical framework called Expectation-Maximization or Bayesian techniques to estimate the abundances and counts. Following quantification, we will use the tximeta (or r Biocpkg("tximport") (Soneson, Love, and Robinson 2015)) package for assembling estimated count and offset matrices for use with Bioconductor differential gene expression packages.

The advantages of using the transcript abundance quantifiers in conjunction with tximeta/tximport to produce gene-level count matrices and normalizing offsets, are: this approach corrects for any potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013); some of these methods are substantially faster and require less memory and disk usage compared to alignment-based methods; and it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence (Robert and Watson 2015). Note that transcript abundance quantifiers skip the generation of large files which store read alignments (SAM or BAM files), instead producing smaller files which store estimated abundances, counts and effective lengths per transcript. For more details, see the manuscript describing this approach (Soneson, Love, and Robinson 2015) and the tximport package vignette for software details.

A full tutorial on how to use the Salmon software for quantifying transcript abundance can be found here, but in the next section we will provide the specific command line code that was used to quantify the eight samples that will be used in this workflow.

3.2 Salmon quantification

The Salmon software logo)

We begin by providing Salmon with the sequence of all of the reference transcripts, which we will call the reference transcriptome. We used version 27 of the GENCODE human transcripts, downloaded from the GENCODE website. As of writing, GENCODE has already released newer versions, but for computational reproducibility, all previous versions are available from the website. On the command line, creating the transcriptome index looks like:

salmon index -i gencode.v27_salmon_0.8.2 -t gencode.v27.transcripts.fa.gz

The 0.8.2 refers to the version of Salmon that was used. As of writing, Salmon is up to version 0.14.1 and it’s always best to use the latest version of software available.

To quantify an individual sample, sample_01, the following command can be used:

salmon quant -i gencode.v27_salmon_0.8.2 -p 6 --libType A \
  --gcBias --biasSpeedSamp 5 \
  -1 sample_01_1.fastq.gz -2 sample_01_2.fastq.gz \
  -o sample_01

In simple English, this command says to “quantify a sample using this transcriptome index, with 6 threads, using automatic library type detection, using GC bias correction (the bias speed part is now longer needed with current versions of Salmon), here are the first and second read, and use this output directory.” The output directory will be created if it doesn’t exist, though if earlier parts of the path do not exist, it will give an error. A single sample of human RNA-seq usually takes ~5 minutes with the GC bias correction.

Rather than writing the above command on the command line, for quantifying these eight samples, a simple R script was written to perform the following Salmon calls from R, looping over the samples. It is also possible to loop over files using a bash loop, or more advanced workflow management systems such as Snakemake (Köster and Rahmann 2012) or Nextflow (Di Tommaso et al. 2017).

3.3 Importing into R with tximport

Following quantification, we can use tximport to import the data into R and perform statistical analysis using Bioconductor packages. Normally, we would simply point tximport to the quant.sf files on our machine. However, because we are distributing these files as part of an R package, we have to do some extra steps, to figure out where the R package, and so the files, are located on your machine.

The output directories from the above Salmon quantification calls has been stored in the extdata directory of the R package called airway2.

library("devtools")
install_github("mikelove/airway2")

The R function system.file can be used to find out where on your computer the files from a package have been installed. Here we ask for the full path to the extdata directory, where R packages store external data, that is part of the airway2 package.

library("airway2")
dir <- system.file("extdata", package="airway2")
dir
list.files(dir)

The quantification directories are in the quants directory.

list.files(file.path(dir,"quants"))

The identifiers used here are the SRR identifiers from the Sequence Read Archive. For this experiment, we downloaded a table that links the SRR identifiers to the sample information about each experiment. From this Run Selector view of the experiment, we clicked the button: RunInfo Table. This downloads a file called SraRunTable.txt. We included this file in the airway2 package, and we read it in now:

coldata <- read.delim(file.path(dir, "SraRunTable.txt"))
colnames(coldata)
idx <- c("Run","cell_line","treatment")
coldata[,idx]

For simplicity we restrict to these three columns, and do a little cleanup of the column names and the levels of the factors.

coldata <- coldata[,idx]
colnames(coldata) <- c("run","cell","dex")
coldata$cell
coldata$dex

We can rename the levels of the dex column, but we have to be careful that our new level names correspond 1-1 to the previous levels. Here we replace Dexamethasone with trt and Untreated with untrt. There is nothing wrong with the longer names, but sometimes shorter factor levels are easier to work with, and as long as the shorter levels can be understood by other analysts, it saves keystrokes.

levels(coldata$dex)
levels(coldata$dex) <- c("trt","untrt")
levels(coldata$dex)

Note: it is prefered in R that the first level of a factor be the reference level (e.g. control, or untreated samples), so we can relevel the dex factor like so:

coldata$dex <- relevel(coldata$dex, "untrt")

Now we can create a named vector that points to the Salmon quantification files. Because we have gzipped the quants for distribution, we point to quant.sf.gz. Normally would just be quant.sf. (The importing software can import directly from gzipped files.)

files <- file.path(dir,"quants",coldata$run,"quant.sf.gz")
names(files) <- coldata$run
all(file.exists(files))

Because we perform a gene-level analysis we need the mapping of transcripts to genes for GENCODE v27. Since GENCODE uses Ensembl transcript identifiers, we use the ensembldb package (Rainer, Gatto, and Weichenberger 2019) to extract these from one of the available annotation databases. Pre-built EnsDb annotation databases for all species for a large number of Ensembl releases are available through the AnnotationHub resource. We thus load below an EnsDb database with human annotations for Ensembl release 91, which corresponds to GENCODE v27. The parameter localHub = TRUE ensures that we list and load only locally stored resources, due to the very limited internet connectivity at the lab site.

library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)

ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91

edb <- ah[[names(ah_91)]]

We next define a data.frame that connects transcripts to genes. Note that the information in tx2gene needs to directly match the reference transcriptome used by Salmon.

tx2gene <- transcripts(edb, columns = c("tx_id_version", "gene_id", "tx_id"),
                       return.type = "data.frame")
colnames(tx2gene) <- c("TXNAME", "GENEID", "tx_id")
head(tx2gene)

Finally the following line of code imports Salmon transcript quantifications into R, collapsing to the gene level using the information in tx2gene.

library("tximport")
library("jsonlite")
library("readr")
txi <- tximport(files, type="salmon", tx2gene=tx2gene)

The txi object is simply a list of matrices (and one character vector):

names(txi)
txi$counts[1:3,1:3]
txi$length[1:3,1:3]
txi$abundance[1:3,1:3]
txi$countsFromAbundance

3.4 Alternative: Importing into R with tximeta

In the previous section we showed how to import the quantifications from Salmon into a list of matrices in R, using the tximport package. The tximeta package provides a wrapper around tximport, which in addition to importing the quantifications also generates a SummarizedExperiment object (see below). Moreover, it automatically identifies the transcriptome that was used for the quantification, and downlaods the corresponding annotation information. Note that this requires an internet connection. Here, we will illustrate how to use the tximeta package to import the data above.

The input to tximeta is a data frame with sample information. This data frame must contain at least two columns, named names and files, and containing the sample IDs and the paths to the Salmon output files, respectively. Since we already have the coldata table above, we just have to rename the Run column and add the paths to the Salmon files.

coldata2 <- coldata
colnames(coldata2)[colnames(coldata2) == "run"] <- "names"
coldata2$files <- file.path(dir,"quants",coldata2$names,"quant.sf.gz")
all(file.exists(coldata2$files))
coldata2

Next, we call tximeta, using the sample information data frame as input argument.

library("tximeta")
(txim <- tximeta(coldata = coldata2))

By default, tximeta returns a SummarizedExperiment object with transcript-level quantifications. Since we are interested in gene-level differential expression analysis, we summarize the abundances on the gene level.

(txim <- summarizeToGene(txim))

3.5 Branching point

At this point, we have quantified the RNA-seq fragments and summarized to gene-level counts and abundances. This is a branching point where we could use a variety of Bioconductor packages for exploration and differential expression of the count data, including edgeR (Robinson, McCarthy, and Smyth 2009), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and baySeq (Hardcastle and Kelly 2010). Schurch et al. (2016) compared performance of different statistical methods for RNA-seq using a large number of biological replicates and can help users to decide which tools make sense to use, and how many biological replicates are necessary to obtain a certain sensitivity. We will continue using DESeq2 (Love, Huber, and Anders 2014) and edgeR (Robinson, McCarthy, and Smyth 2009) below.

3.6 SummarizedExperiment

The DESeq2 package will store the data and intermediate results in an object called a DESeqDataSet. This object builds on top of a general Bioconductor class of object called the SummarizedExperiment. We will therefore first introduce the SummarizedExperiment and we begin with a diagram of the three component pieces:

The SummarizedExperiment container is diagrammed in the Figure above and discussed in the latest Bioconductor paper (Huber et al. 2015). In our case we have created a single matrix named "counts" that contains the estimated counts for each gene and sample, which is stored in assay. It is also possible to store multiple matrices, accessed with assays. The rowData for our object will keep track of the intermediate calculations for each gene: e.g. the mean of normalized counts, the dispersion estimates, the estimated coefficients and their standard error, etc. We can also include information about the location of each gene, in which case, we would use rowRanges to access the GenomicRanges or GenomicRangesList associated with each gene.The component parts of the SummarizedExperiment are accessed with an R function of the same name: assay (or assays), rowRanges/rowData and colData.

4 The DESeqDataSet object, sample information and the design formula

Bioconductor software packages often define and use a custom class for storing data that makes sure that all the needed data slots are consistently provided and fulfill the requirements. In addition, Bioconductor has general data classes (such as the SummarizedExperiment) that can be used to move data between packages. Additionally, the core Bioconductor classes provide useful functionality: for example, subsetting or reordering the rows or columns of a SummarizedExperiment automatically subsets or reorders the associated rowData and colData, which can help to prevent accidental sample swaps that would otherwise lead to spurious results. With SummarizedExperiment this is all taken care of behind the scenes.

In DESeq2, the custom class is called DESeqDataSet. One of the main differences with SummarizedExperiment is that we will use the counts function to access the assay matrix, and it is required that this matrix stores non-negative integers (this is to prevent mistaken use of the package for non-count data).

A second difference is that the DESeqDataSet has an associated design formula. The experimental design is specified at the beginning of the analysis, as it will inform many of the DESeq2 functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which columns in the sample information table (colData) specify the experimental design and how these factors should be used in the analysis.

The simplest design formula for differential expression would be ~ condition, where condition is a column in colData(dds) that specifies which of two (or more groups) the samples belong to. For the airway experiment, we will specify ~ cell + dex meaning that we want to test for the effect of dexamethasone (dex) controlling for the effect of different cell line (cell).

We can construct a DESeqDataSet object from the tximport object using the function DESeqDataSetFromTximport:

library("DESeq2")
dds <- DESeqDataSetFromTximport(txi, coldata, design = ~cell + dex)

For running DESeq2 or edgeR models, you can use R’s formula notation to express any fixed-effects experimental design. Note that DESeq2 and edgeR use the same formula notation as, for instance, the lm function of base R. If the research aim is to determine for which genes the effect of treatment is different across groups, then interaction terms can be included and tested using a design such as ~ group + treatment + group:treatment. See the manual page for ?results for more examples.

4.1 Creating a DGEList for use with edgeR

The edgeR package uses another type of data container, namely a DGEList object. It is just as easy to create a DGEList object using the count matrix and information about samples. We can additionally add information about the genes:

genetable <- data.frame(gene.id = rownames(txi$counts))

Because we are using counts from tximport, we have a slightly different procedure to import them, which involves calculating a statistical offset that will account for differences in gene length across samples. The following code is from the tximport vignette in the edgeR section.

library("edgeR")
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))

Actually creating the DGEList object simply entails providing the counts, sample information, gene and information. The second line adds an appropriate offset.

dge <- DGEList(counts = cts,
               samples = coldata,
               genes = genetable)
dge <- scaleOffset(dge, t(t(log(normMat)) + o))
names(dge)

A much simpler approach with respect to the amount of code is to generate counts-from-abundance using tximport, which correct for the gene length changes directly. Example for this code is:

txi2 <- tximport(files, type="salmon", tx2gene=tx2gene,
                 countsFromAbundance="lengthScaledTPM")
dge2 <- DGEList(counts = txi2$counts,
                samples = coldata,
                genes = genetable)
names(dge2)

The counts + offset method and the counts-from-abundance method are not statistically identical (obviously), but tend to output roughly similar groups of genes. In this dataset, using counts-from-abundance has 96% of genes in common with count + offset at a target 5% FDR.

Just like the SummarizedExperiment and the DESeqDataSet the DGEList contains all the information we need: the count matrix, information about the samples (columns of the count matrix), and information about the genes (rows of the count matrix).

5 Exploratory analysis and visualization

There are two separate paths in this workflow; the one we will see first involves transformations of the counts in order to visually explore sample relationships. In the second part, we will go back to the original raw counts for statistical testing. This is critical because the statistical testing methods rely on original count data (not scaled or transformed) for calculating the precision of measurements.

5.1 Pre-filtering the dataset

Our count matrix with our DESeqDataSet contains many rows with only zeros, and additionally many rows with only a few fragments in total. In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression. Here we apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples. Additional weighting/filtering to improve power is applied at a later step in the workflow.

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

Here, we also filter the DGEList for edgeR in the same way. Note, however, that edgeR does not apply filtering internally, and for this reason, typically a stronger prefiltering criterion is used at this stage for edgeR.

dge <- dge[rowSums(round(dge$counts)) > 1, ]
all(rownames(dge) == rownames(dds))

We use the above code mostly for aiding comparison with DESeq2 in later sections. The recommended filtering code for edgeR (here not evaluated) is as follows:

dge <- dge[filterByExpr(dge),]

5.2 Variance stabilizing transformations

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq counts, however, the expected variance grows with the mean. For example, if one performs PCA directly on a matrix of counts or normalized counts (e.g. correcting for differences in sequencing depth), the resulting plot typically depends mostly on the genes with highest counts because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a pseudocount of 1; however, depending on the choice of pseudocount, now the genes with the very lowest counts will contribute a great deal of noise to the resulting plot, because taking the logarithm of small counts actually inflates their variance. We can quickly show this property of counts with some simulated data (here, Poisson counts with a range of lambda from 0.1 to 100). We plot the standard deviation of each row (genes) against the mean:

lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library("vsn")
meanSdPlot(cts, ranks = FALSE)

And for logarithm-transformed counts after adding a pseudocount of 1:

log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)

The logarithm with a small pseudocount amplifies differences when the values are close to 0. The low count genes with low signal-to-noise ratio will overly contribute to sample-sample distances and PCA plots.

As a solution, DESeq2 offers two transformations for count data that stabilize the variance across the mean: (1) the the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber 2010), implemented in the vst function, and (2) the regularized-logarithm transformation or rlog (Love, Huber, and Anders 2014).

For genes with high counts, the VST and rlog will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. The VST or rlog-transformed data then becomes approximately homoskedastic, and can be used directly for computing distances between samples, making PCA plots, or as input to downstream methods which perform best with homoskedastic data.

Which transformation to choose? Mike tends to prefer the VST. The VST is much faster to compute via the function vst and is less sensitive to high count outliers than the rlog. Perhaps further development on the rlog would make it less sensitive to outliers. (Why have the rlog at all then? In the DESeq2 paper, the rlog sometimes outperformed the VST in terms of recovering clusters when there was a large range of sequencing depth across samples e.g. x10 from lowest to highest sequencing depth.)

Note that the two transformations offered by DESeq2 are provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.

Both transformation functions return an object based on the SummarizedExperiment class that contains the transformed values in its assay slot.

vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
meanSdPlot(assay(vsd), ranks = FALSE)

rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
meanSdPlot(assay(rld), ranks = FALSE)

In the above function calls, we specified blind = FALSE, which means that differences between cell lines and treatment (the variables in the design) will not contribute to the expected variance-mean trend of the experiment. The experimental design is not used directly in the transformation, only in estimating the global amount of variability in the counts. For a fully unsupervised transformation, one can set blind = TRUE (which is the default).

To show the effect of the transformation, in the figure below we plot the first sample against the second, first simply using the log2 function of normalized counts (after adding 1, to avoid taking the log of zero), and then using the VST and rlog-transformed values.

library("dplyr")
library("ggplot2")
library("hexbin")
ntd <- normTransform(dds)

df <- bind_rows(
  as_data_frame(assay(ntd)[, 1:2]) %>% mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
colnames(df)[1:2] <- c("x", "y")  

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid(. ~ transformation)

We can see how genes with low counts (bottom left-hand corner) seem to be excessively variable on the ordinary logarithmic scale, while the VST and rlog compress differences for the low count genes for which the data provide little information about differential expression.

5.3 Sample distances

A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design?

We use the R function dist to calculate the Euclidean distance between samples. To ensure we have a roughly equal contribution from all genes, we use it on the VST data. We need to transpose the matrix of values using t, because the dist function expects the different samples to be rows of its argument, and different dimensions (here, genes) to be columns.

sampleDists <- dist(t(assay(vsd)))
sampleDists

We visualize the distances in a heatmap in a figure below, using the function pheatmap from the pheatmap package.

library("pheatmap")
library("RColorBrewer")

In order to plot the sample distance matrix with the rows/columns arranged by the distances in our distance matrix, we manually provide sampleDists to the clustering_distance argument of the pheatmap function. Otherwise the pheatmap function would assume that the matrix contains the data values themselves, and would calculate distances between the rows/columns of the distance matrix, which is not desired. We also manually specify a blue color palette using the colorRampPalette function from the RColorBrewer package.

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap.

Another option for calculating sample distances is to use the Poisson Distance (Witten 2011), implemented in the PoiClaClu package. This measure of dissimilarity between counts also takes the inherent variance structure of counts into consideration when calculating the distances between samples. The PoissonDistance function takes the original count matrix (not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.

library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds)))

We plot the heatmap in a Figure below.

samplePoisDistMatrix <- as.matrix(poisd$dd)
rownames(samplePoisDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)

5.4 PCA plot

Another way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1. The y-axis is a direction (it must be orthogonal to the first direction) that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not add to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

plotPCA(vsd, intgroup = c("dex", "cell"))

Here, we have used the function plotPCA that comes with DESeq2. The two terms specified by intgroup are the interesting groups for labeling the samples; they tell the function to use them to choose colors. We can also build the PCA plot from scratch using the ggplot2 package (Wickham 2009). This is done by asking the plotPCA function to return the data used for plotting rather than building the plot. See the ggplot2 documentation for more details on using ggplot.

pcaData <- plotPCA(vsd, intgroup = c("dex", "cell"), returnData = TRUE)
pcaData
percentVar <- round(100 * attr(pcaData, "percentVar"))

We can then use these data to build up a second plot in a figure below, specifying that the color of the points should reflect dexamethasone treatment and the shape should reflect the cell line.

ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed()

From the PCA plot, we see that the differences between cells (the different plotting shapes) are considerable, though not stronger than the differences due to treatment with dexamethasone (red vs blue color). This shows why it will be important to account for this in differential testing by using a paired design (“paired”, because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this design by assigning the formula ~ cell + dex earlier.

5.5 MDS plot

Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don’t have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the VST data and plot these in a figure below.

mds <- as.data.frame(colData(vsd))  %>%
         cbind(cmdscale(sampleDistMatrix))
ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()

In a figure below we show the same plot for the PoissonDistance:

mdsPois <- as.data.frame(colData(dds)) %>%
   cbind(cmdscale(samplePoisDistMatrix))
ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()

6 Differential expression analysis

6.1 Performing differential expression testing with DESeq2

As we have already specified an experimental design when we created the DESeqDataSet, we can run the differential expression pipeline on the raw counts with a single call to the function DESeq:

dds <- DESeq(dds)

This function will print out a message for the various steps it performs. These are described in more detail in the manual page for DESeq, which can be accessed by typing ?DESeq. Briefly these are: the estimation of size factors (controlling for differences in the sequencing depth of the samples), the estimation of dispersion values for each gene, and fitting a generalized linear model.

A DESeqDataSet is returned that contains all the fitted parameters within it, and the following section describes how to extract out results tables of interest from this object.

6.2 Building the results table

Calling results without any arguments will extract the estimated log2 fold changes and p values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level. The comparison is printed at the top of the output: dex trt vs untrt.

res <- results(dds)
res

We could have equivalently produced this results table with the following more specific command. Because dex is the last variable in the design, we could optionally leave off the contrast argument to extract the comparison of the two levels of dex.

res <- results(dds, contrast = c("dex", "trt", "untrt"))

As res is a DataFrame object, it carries metadata with information on the meaning of the columns:

mcols(res, use.names = TRUE)

The first column, baseMean, is a just the average of the normalized count values, divided by the size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the trt level over the untrt level for the factor variable dex. We will find out below how to obtain other contrasts.

The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of \(2^{1.5} \approx 2.82\).

Of course, this estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. We can also express the uncertainty of a particular effect size estimate as the result of a statistical test. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.

We can also summarize the results with the following line of code, which reports some additional information, that will be covered in later sections.

summary(res)

Note that there are many genes with differential expression due to dexamethasone treatment at the FDR level of 10%. This makes sense, as the smooth muscle cells of the airway are known to react to glucocorticoid steroids. However, there are two ways to be more strict about which set of genes are considered significant:

lower the false discovery rate threshold (the threshold on padj in the results table)
raise the log2 fold change threshold from 0 using the lfcThreshold argument of results

If we lower the false discovery rate threshold, we should also inform the results() function about it, so that the function can use this threshold for the optimal independent filtering that it performs:

res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)

If we want to raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment, we simply supply a value on the log2 scale. For example, by specifying lfcThreshold = 1, we test for genes that show significant effects of treatment on gene counts more than doubling or less than halving, because \(2^1=2\).

resLFC1 <- results(dds, lfcThreshold = 1)
table(resLFC1$padj < 0.05)

Sometimes a subset of the p values in res will be NA (“not available”). This is DESeq’s way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.

6.3 Independent hypothesis weighting within DESeq2

The results function in DESeq2 integrates with a new method for dealing with genes that have low counts and therefore low power for detecting differential gene expression. This method is called Independent Hypothesis Weighting (IHW) (Ignatiadis et al. 2016), and generalizes the idea of filtering low count genes to reduce the multiple testing burden. While the specifics of the method are beyond the scope of this workflow, we demonstrate its usage here:

library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)

The results are similar but not identical to the results table built above, which uses a simple optimization of a filter over mean normalized counts. The new procedure results in a gain of 50 genes.

table(old=res$padj < .1, IHW=resIHW$padj < .1)

We can plot the weighting function which is fit by the IHW package:

plot(metadata(resIHW)$ihwResult)

6.4 Other comparisons

In general, the results for a comparison of any two levels of a variable can be extracted using the contrast argument to results. The user should specify three values: the name of the variable, the name of the level for the numerator, and the name of the level for the denominator. Here we extract results for the log2 of the fold change of one cell line over another:

results(dds, contrast = c("cell", "N061011", "N61311"))

There are additional ways to build results tables for certain comparisons after running DESeq once. If results for an interaction term are desired, the name argument of results should be used. Please see the help page for the results function for details on the additional ways to build results tables. In particular, the Examples section of the help page for results gives some pertinent examples.

6.5 Annotation of RNA-seq results

One of the central steps in the analysis of genomic data is the annotation of the quantified entities to biologically more relevant representations such as transcripts, genes or proteins. Such annotations enable for example pathway enrichment analyses and ease the interpretation of the results. Bioconductor provides a large variety of annotation packages and resources, most of them supporting the AnnotationDbi interface which enables a unified way to retrieve annotations for given identifiers. The annotation resources from Bioconductor range from web-based tools such as biomaRt to packages working with pre-built databases containing all identifier mappings for a certain species (the *.org packages such as org.Hs.eg.db) or providing gene and transcript models, such as databases build by GenomicFeatures (TxDb databases/packages) and ensembldb (EnsDb databases/packages) (Rainer, Gatto, and Weichenberger 2019).

In this section we use the ensembldb package to annotate all tested genes which are identified by Ensembl gene IDs. The Salmon quantification used in the analysis leading to this table was based on GENCODE v27 transcripts which is linked to Ensembl release 91. We thus load below an EnsDb database with human annotations for Ensembl release 91 from the AnnotationHub resource. The parameter localHub = TRUE ensures that we list and load only locally stored resources (due to the very limited internet connectivity at the lab site).

library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)

ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91

#' Load the database
edb <- ah[[names(ah_91)]]

If we had a working internet connection we could simply list all available EnsDb databases with query(ah, "EnsDb") (after loading AnnotationHub with localHub = FALSE). We can now use the edb variable to access the data in the database. With the listColumns function we can for example list all of the available annotation fields (columns) in the database.

listColumns(edb)

We could get data from any of these database columns by passing the respective column name(s) along with the column parameter to any function retrieving annotations from an EnsDb database. Gene-related annotations can be retrieved with the genes function, that by default (and similar to the genes function from the GenomicFeatures package) returns a GRanges object with the genomic coordinates of the genes and additional annotation columns in the GRanges’ metadata columns. Below we use the genes call to retrieve annotations for all human genes. With the parameter return.type = "DataFrame" we tell the function to return the result as a DataFrame instead of the default GRanges.

anns <- genes(edb, return.type = "DataFrame")
rownames(anns) <- anns$gene_id

We can now combine the RNA-seq result with the annotation.

resAnn <- BiocGenerics::cbind(anns[BiocGenerics::rownames(res), ], res)

resAnn

If we had only a reduced set of genes to annotate, e.g. the set of the most significant genes, it is faster and more elegant to extract annotations for the subset of genes using the filter framework from ensembldb. Below we create a top table by selecting the 20 genes with the smallest p-values.

resTop20 <- res[order(res$pvalue), ][1:20, ]

To get only annotations for the selected genes we pass the filter expression ~ gene_id == rownames(resTop20) with the filter parameter to the genes function. Filter expressions have to be written in the form of a formula (i.e. starting with ~) and support any logical R expression and any database column/field in the EnsDb database (use supportedFilters(edb) to list all supported fields). Also, by specifying the respective field names with the columns parameter we extract only the gene name (equivalent with the HGNC symbol for most genes), description the gene biotype and the NCBI Entrezgene ID from the database.

anns <- genes(edb, filter = ~ gene_id == rownames(resTop20),
              columns = c("gene_name", "description", "gene_biotype",
                          "entrezid"), return.type = "DataFrame")

Note that the order of the genes in the returned data.frame is not the same as in resTop20. We thus below re-order the annotations to match the order of genes in the top table and subsequently join the two data frames.

resTop20 <- cbind(anns[match(rownames(resTop20), anns$gene_id), ],
                  resTop20)

Be aware that mappings between Ensembl and NCBI identifiers is not necessarily a 1:1 mapping. In our case we got for some Ensembl gene IDs more than one NCBI Entrezgene ID. The column "entrezid" in our result table is thus a list with eventually more than one Entrezgene ID in one row. Exporting such a table could be problematic and we collapse therefore Entrezgene IDs in this columns into a single character string, with multiple IDs separated by a semicolon (";").

resTop20$entrezid <- sapply(resTop20$entrezid, function(z) {
    if (any(is.na(z))) z
    else paste(z, collapse = ";")
})

The 20 most significantly regulated genes in the present analysis are listed below.

6.5.1 Using ensembldb to query for genes encoding a specific protein domain

For some experiments it might be interesting to search for genes, or rather transcripts, that encode a protein with a certain protein domain. For the present data set we could for example search for genes with proteins encoding the ligand binding domain of nuclear hormone receptors (such as the glucocorticoid receptor, the gene activated by treatment with the synthetic glucocorticoid dexamethasone and hence being mainly responsible for the transcriptional changes observed in the present dataset). Below we thus query the database for all such genes using the protein domain ID PF00104 from Pfam. The filter used in the query below will return all genes with a transcript annotated to the specific protein domain restricting to genes that are detected in the present analysis.

hormoneR <- genes(edb,
                  filter = ~ protein_domain_id == "PF00104" &
                      gene_id == rownames(res),
                  return.type = "data.frame")
rownames(hormoneR) <- hormoneR$gene_id
nrow(hormoneR)

How many genes were identified? Next we identify nuclear hormone receptors which are significantly differentially expressed at a 5% FDR in the present data set.

resHormoneR <- res[hormoneR$gene_id, ]
resHormoneR <- resHormoneR[which(resHormoneR$padj < 0.05), ]
resHormoneR <- cbind(hormoneR[rownames(resHormoneR), ],
                     resHormoneR)

The table below lists the significantly differentially expressed hormone receptors.

knitr::kable(resHormoneR[order(resHormoneR$pvalue),
                         c("gene_name", "padj", "log2FoldChange")])

Among the regulated hormone receptors are NR4A3, encoding a member of the steroid-thyroid hormone-retinoid receptor superfamily (4-fold upregulated), but also the glucocorticoid receptor (NR3C1) itself, which is known to down-regulate in certain cell types its own gene as part of a negative feedback loop.

6.6 Performing differential expression testing with edgeR

Next we will show how to perform differential expression analysis with edgeR. Recall that we have a DGEList dge, containing three objects:

names(dge)

We first define a design matrix, using the same formula syntax as for DESeq2 above.

design <- model.matrix(~ cell + dex, dge$samples)

Then, we calculate normalization factors and estimate the dispersion for each gene. Note that we need to specify the design in the dispersion calculation.

dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, design)

Finally, we fit the generalized linear model and perform the test. In the glmQLFTest function, we indicate which coefficient (which column in the design matrix) that we would like to test for. It is possible to test more general contrasts as well, and the user guide contains many examples on how to do this. The topTags function extracts the top-ranked genes. You can indicate the adjusted p-value cutoff, and/or the number of genes to keep.

fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit, coef = ncol(design))
tt <- topTags(qlf, n = nrow(dge), p.value = 0.1)
tt10 <- topTags(qlf) # just the top 10 by default
tt10

We can compare to see how the results from the two software overlap:

tt.all <- topTags(qlf, n = nrow(dge), sort.by = "none")
table(DESeq2 = res$padj < 0.1, edgeR = tt.all$table$FDR < 0.1)

We can also compare the two lists by the ranks:

common <- !is.na(res$padj)
plot(rank(res$padj[common]), 
     rank(tt.all$table$FDR[common]), cex = .1,
     xlab = "DESeq2", ylab = "edgeR")

Also with edgeR we can test for significance relative to a fold-change threshold, using the function glmTreat instead of glmLRT. Below we set the log fold-change threshold to 1 (i.e., fold change threshold equal to 2), as for DESeq2 above. We also compare the results to those obtained with DESeq2.

treatres <- glmTreat(fit, coef = ncol(design), lfc = 1)
tt.treat <- topTags(treatres, n = nrow(dge), sort.by = "none")
table(DESeq2 = resLFC1$padj < 0.1, edgeR = tt.treat$table$FDR < 0.1)

6.7 Citing packages used in research

If you use the results from an R analysis package in published research, you can find the proper citation for the software by typing citation("pkgName"), where you would substitute the name of the package for pkgName. Citing methods papers helps to support and reward the individuals who put time into open source software for genomic data analysis.

7 Plotting results

7.1 Counts plot with DESeq2

A quick way to visualize the counts for a particular gene is to use the plotCounts function that takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts (figure below).

topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = topGene, intgroup = c("dex"))

We can also make custom plots using the ggplot function from the ggplot2 package (figures below).

library("ggbeeswarm")
geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),
                         returnData = TRUE)
ggplot(geneCounts, aes(x = dex, y = count, color = cell)) +
  scale_y_log10() +  geom_beeswarm(cex = 3)

ggplot(geneCounts, aes(x = dex, y = count, color = cell, group = cell)) +
  scale_y_log10() + geom_point(size = 3) + geom_line()

7.2 MA plot with DESeq2

An MA plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. You may hear this plot also referred to as a mean-difference plot, or a Bland-Altman plot.

Before making the MA plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of dex treated vs untreated samples. This typically takes about 20 seconds on a laptop.

library("apeglm")
resultsNames(dds)
res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))

The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts, as can be seen by the narrowing of the vertical spread of points on the left side of the MA plot. As shown above, the lfcShrink function performs this operation. For a detailed explanation of the rationale of moderated fold changes, please see the DESeq2 paper (Love, Huber, and Anders 2014).

If we had not used statistical moderation to shrink the noisy log2 fold changes, we would have instead seen the following plot:

res.noshr <- results(dds)
DESeq2::plotMA(res.noshr, xlim=c(5,1e6), ylim=c(-5,5))

We can label individual points on the MA plot as well. Here we use the with R function to plot a circle and text for a selected row of the results object. Within the with function, only the baseMean and log2FoldChange values for the selected rows of res are used.

DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col = "dodgerblue", cex = 2, lwd = 2)
  text(baseMean, log2FoldChange, topGene, pos = 2, col = "dodgerblue")
})

Another useful diagnostic plot is the histogram of the p values (figure below). This plot is best formed by excluding genes with very small counts, which otherwise generate spikes in the histogram.

hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")

7.3 Counts plot with edgeR

As we made a counts plot previously, we can also make this with the top gene from the edgeR analysis:

cpms <- cpm(dge)
topGene <- as.character(tt10$table$gene.id[1])
ord <- order(dge$samples$dex)
col <- as.integer(dge$samples$dex[ord])
par(mar=c(10,4,4,2))
barplot(cpms[topGene,ord],
        col=c("skyblue","orange")[col],
        names.arg=paste(dge$sample$dex[ord],"--",dge$sample$cell[ord]),
        las=2, ylab="counts per million (CPM)")

7.4 MA / Smear plot with edgeR

In edgeR, the MA plot is obtained via the plotSmear function.

plotSmear(qlf, de.tags = tt$table$gene.id)

7.5 Gene clustering

In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, let us select the 20 genes with the highest variance across samples. We will work with the VST data:

library("genefilter")
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)

The heatmap becomes more interesting if we do not look at absolute expression strength but rather at the amount by which each gene deviates in a specific sample from the gene’s average across all samples. Hence, we center each genes’ values across samples, and plot a heatmap (figure below). We provide a data.frame that instructs the pheatmap function how to label the columns.

mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)

7.6 Exporting results

You can easily save the results table in a CSV file that you can then share or load with a spreadsheet program such as Excel. The call to as.data.frame is necessary to convert the DataFrame object (IRanges package) to a data.frame object that can be processed by write.csv. Here, we take just the top 100 genes for demonstration.

resOrderedDF <- as.data.frame(res[order(res$pvalue), ])[1:100, ]
write.csv(resOrderedDF, file = "results.csv")

A more sophisticated way for exporting results the Bioconductor package ReportingTools (Huntley et al. 2013). ReportingTools will automatically generate dynamic HTML documents, including links to external databases using gene identifiers and boxplots summarizing the normalized counts across groups. See the ReportingTools vignettes for full details. The simplest version of creating a dynamic ReportingTools report is performed with the following code:

library("ReportingTools")
htmlRep <- HTMLReport(shortName = "report", title = "My report",
                      reportDirectory = "./report")
publish(resOrderedDF %>% tibble::rownames_to_column("gene"), htmlRep)
url <- finish(htmlRep)
browseURL(url)

7.7 Exploring results interactively

The iSEE package enables interactive exploration of any data stored in a SummarizedExperiment container. Since the DESeqDataSet class directly extends SummarizedExperiment, it can also be used as input to iSEE. To open an iSEE instance, we simply type

library("iSEE")
if (interactive()) {
  iSEE(dds)
}

We can also add additional information, such as the results of the statistical test, to the rowData of the object before providing it to iSEE.

stopifnot(all(rownames(dds) == rownames(res)))
res$neglog10pvalue <- -log10(res$pvalue)
rowData(dds)$res <- res
if (interactive()) {
  iSEE(dds)
}

8 Removing hidden batch effects

Suppose we did not know that there were different cell lines involved in the experiment, only that there was treatment with dexamethasone. The cell line effect on the counts then would represent some hidden and unwanted variation that might be affecting many or all of the genes in the dataset. We can use statistical methods designed for RNA-seq from the sva package (Leek 2014) or the RUVSeq package (Risso et al. 2014) in Bioconductor to detect such groupings of the samples, and then we can add these to the DESeqDataSet design, in order to account for them.

The SVA package uses the term surrogate variables for the estimated variables that we want to account for in our analysis, while the RUV package uses the terms factors of unwanted variation with the acronym “Remove Unwanted Variation” explaining the package title. We first use SVA to find hidden batch effects and then RUV following.

8.1 Using SVA with DESeq2

library("sva")

Below we obtain a matrix of normalized counts for which the average count across samples is larger than 1. As we described above, we are trying to recover any hidden batch effects, supposing that we do not know the cell line information. So we use a full model matrix with the dex variable, and a reduced, or null, model matrix with only an intercept term. Finally we specify that we want to estimate 2 surrogate variables. For more information read the manual page for the svaseq function by typing ?svaseq.

dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~   1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
svseq$sv

Because we actually do know the cell lines, we can see how well the SVA method did at recovering these variables (figure below).

par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
  stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))
  abline(h = 0)
 }

Finally, in order to use SVA to remove any effect on the counts from our surrogate variables, we simply add these two surrogate variables as columns to the DESeqDataSet and then add them to the design:

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex

We could then produce results controlling for surrogate variables by running DESeq with the new design.

8.2 Using RUV with DESeq2

We can also use the RUV method in the RUVSeq package to detect the hidden batch effects.

library("RUVSeq")

We can use the RUVg function to estimate factors of unwanted variation, analogous to SVA’s surrogate variables. A difference compared to the SVA procedure above, is that we first would run DESeq and results to obtain the p-values for the analysis without knowing about the batches, e.g. just ~ dex. Supposing that we have this results table res, we then pull out a set of empirical control genes by looking at the genes that do not have a small p-value.

set <- newSeqExpressionSet(counts(dds))
idx  <- rowSums(counts(set) > 5) >= 2
set  <- set[idx, ]
set <- betweenLaneNormalization(set, which="upper")
not.sig <- rownames(res)[which(res$pvalue > .1)]
empirical <- rownames(set)[ rownames(set) %in% not.sig ]
set <- RUVg(set, empirical, k=2)
pData(set)

We can plot the factors estimated by RUV:

par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
  stripchart(pData(set)[, i] ~ dds$cell, vertical = TRUE, main = paste0("W", i))
  abline(h = 0)
 }

As before, if we wanted to control for these factors, we simply add them to the DESeqDataSet and to the design:

ddsruv <- dds
ddsruv$W1 <- set$W_1
ddsruv$W2 <- set$W_2
design(ddsruv) <- ~ W1 + W2 + dex

We would then run DESeq with the new design to re-estimate the parameters and results.

9 Session information

As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to track down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.

The session information should also always be included in any emails to the Bioconductor support site along with all code used in the analysis.

devtools::session_info()

References

Anders, Simon, and Wolfgang Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biology 11 (10). BioMed Central Ltd: R106+. https://doi.org/10.1186/gb-2010-11-10-r106.

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal RNA-Seq Quantification.” Nat. Biotechnol.

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35 (4). Nature Research: 316–19.

Dudoit, Rine, Yee H. Yang, Matthew J. Callow, and Terence P. Speed. 2002. “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.” In Statistica Sinica, 111–39. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.9702.

Hardcastle, Thomas, and Krystyna Kelly. 2010. “baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data.” BMC Bioinformatics 11 (1): 422+. https://doi.org/10.1186/1471-2105-11-422.

Himes, Blanca E., Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M. Whitaker, et al. 2014. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PloS One 9 (6). https://doi.org/10.1371/journal.pone.0099625.

Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21. https://doi.org/10.1038/nmeth.3252.

Huntley, Melanie A., Jessica L. Larson, Christina Chaivorapol, Gabriel Becker, Michael Lawrence, Jason A. Hackney, and Joshua S. Kaminker. 2013. “ReportingTools: an automated result processing and presentation toolkit for high-throughput genomic analyses.” Bioinformatics 29 (24). Oxford University Press: 3220–1. https://doi.org/10.1093/bioinformatics/btt551.

Ignatiadis, Nikolaos, Bernd Klaus, Judith Zaugg, and Wolfgang Huber. 2016. “Data-Driven Hypothesis Weighting Increases Detection Power in Genome-Scale Multiple Testing.” Nature Methods. https://doi.org/10.1038/nmeth.3885.

Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a Scalable Bioinformatics Workflow Engine.” Bioinformatics 28 (19). Oxford University Press: 2520–2.

Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2). BioMed Central Ltd: R29+. https://doi.org/10.1186/gb-2014-15-2-r29.

Leek, Jeffrey T. 2014. “svaseq: removing batch effects and other unwanted noise from sequencing data.” Nucleic Acids Research 42 (21). Oxford University Press: 000. https://doi.org/10.1093/nar/gku864.

Leng, N., J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. G. Smits, J. D. Haag, M. N. Gould, R. M. Stewart, and C. Kendziorski. 2013. “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29 (8). Oxford University Press: 1035–43. https://doi.org/10.1093/bioinformatics/btt087.

Li, Bo, and Colin N. Dewey. 2011. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC Bioinformatics 12: 323+. https://doi.org/10.1186/1471-2105-12-3231.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12). BioMed Central Ltd: 550+. https://doi.org/10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14: 417–19.

Patro, Rob, Stephen M. Mount, and Carl Kingsford. 2014. “Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.” Nature Biotechnology 32: 462–64. https://doi.org/10.1038/nbt.2862.

Rainer, Johannes, Laurent Gatto, and Christian X Weichenberger. 2019. “ensembldb: an R package to create and use Ensembl-based annotation resources.” Bioinformatics 14 (January): 925. https://doi.org/10.1093/bioinformatics/btz031.

Risso, Davide, John Ngai, Terence P. Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology 32 (9). Nature Publishing Group: 896–902. https://doi.org/10.1038/nbt.2931.

Robert, Christelle, and Mick Watson. 2015. “Errors in RNA-Seq quantification affect genes of relevance to human disease.” Genome Biology. https://doi.org/10.1186/s13059-015-0734-x.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2009. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1). Oxford University Press: 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Schurch, Nicholas J, Pietá Schofield, Marek Gierliński, Christian Cole, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, et al. 2016. “How Many Biological Replicates Are Needed in an RNA-seq Experiment and Which Differential Expression Tool Should You Use?” RNA 22 (6): 839–51.

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.

Trapnell, Cole, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. 2013. “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology. https://doi.org/10.1038/nbt.2450.

Wickham, Hadley. 2009. ggplot2. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-98141-3.

Witten, Daniela M. 2011. “Classification and clustering of sequencing data using a Poisson model.” The Annals of Applied Statistics 5 (4): 2493–2518. https://doi.org/10.1214/11-AOAS493.

Wu, Hao, Chi Wang, and Zhijin Wu. 2013. “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data.” Biostatistics 14 (2). Oxford University Press: 232–43. https://doi.org/10.1093/biostatistics/kxs033.