1 Overview

Similar as with differential gene expression, we need to make sure that observed differences of exon usage values between conditions are statistically significant, i.e., are sufficiently unlikely to be just due to random fluctuations such as those seen even between samples from the same condition, i.e., between replicates. To this end, DEXSeq assesses the strength of these fluctuations (quantified by the so-called dispersion) by comparing replicates before comparing the averages between the sample groups.

The preceding description is somewhat simplified (and perhaps over-simplified), and we recommend that users consult the publication for a more complete description, as well as Appendix 9.6 of this vignette, which describes how the current implementation of DEXSeq differs from the original approach described in the paper. Nevertheless, two important aspects should be mentioned already here: First, DEXSeq does not actually work on the exon usage ratios, but on the counts in the numerator and denominator, to be able to make use of the information that is contained in the magnitude of count values. For example, 3000 reads versus 1000 reads is the same ratio as 3 reads versus 1 read, but the latter is a far less reliable estimate of the underlying true value, because of statistical sampling. Second, DEXSeq is not limited to simple two-group comparisons; rather, it uses so-called generalized linear models (GLMs) to permit ANOVA-like analysis of potentially complex experimental designs.

2 Preparations

2.1 Example data

To demonstrate the use of DEXSeq, we use the pasilla dataset, an RNA-Seq dataset generated by (Brooks et al. 2011). They investigated the effect of siRNA knock-down of the gene pasilla on the transcriptome of fly S2-DRSC cells. The RNA-binding protein pasilla protein is thought to be involved in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples of splicing factors.) Brooks et al. prepared seven cell cultures, treated three with siRNA to knock-down pasilla and left four as untreated controls, and performed RNA-Seq on all samples. They deposited the raw sequencing reads with the NCBI Gene Expression Omnibus (GEO) under the accession number GSE18508.

2.2 Executability of the code

Usually, Bioconductor vignettes contain automatically executable code, i.e., you can follow the vignette by directly running the code shown, using functionality and data provided with the package. However, it would not be practical to include the voluminous raw data of the pasilla experiment here. Therefore, the code in the alignment section is not automatically executable. You may download the raw data yourself from GEO, as well as the required extra tools, and follow the work flow shown here and in the pasilla vignette (Reyes 2013). From Section 3 on, code is directly executable, as usual. Therefore, we recommend that you just read this section, and try following our analysis in R only from the next section onwards. Once you work with your own data, you will want to come back and adapt the workflow shown here to your data.

2.3 Alignment

The first step of the analysis is to align the reads to a reference genome. If you are using classic alignment methods (i.e. not pseudo-alignment approaches) it is important to align them to the genome, not to the transcriptome, and to use a splice-aware aligner (i.e., a short-read alignment tool that can deal with reads that span across introns) such as TopHat2 (Kim et al. 2013), GSNAP (Wu and Nacu 2010), or STAR (Dobin et al. 2013). The explanation of the analysis workflow presented here starts with the aligned reads in the SAM format. If you are unfamiliar with the process of aligning reads to obtain SAM files, you can find a summary how we proceeded in preparing the pasilla data in the vignette for the pasilla data package (Reyes 2013) and a more extensive explanation, using the same data set, in our protocol article on differential expression calling (Anders et al. 2013).

2.4 Preparing the annotation

In a transcript annotations such as a GTF file, many exons appear multiple times, once for each transcript that contains them. We need to “collapse” or “flatten” this information to define exon counting bins, i.e., a list of intervals, each corresponding to one exon or part of an exon. Counting bins for parts of exons arise when an exonic region appears with different boundaries in different transcripts. See Figure 1 of the DEXSeq paper (Anders, Reyes, and Huber (2012)) for an illustration.

Here we use the function exonicParts() of the package GenomicFeatures to define exonic counting bins. First, we download a annotation file from ENSEMBL in GTF format, create an txdb object and then run the exonicParts() function.

txdb = makeTxDbFromGFF("Drosophila_melanogaster.BDGP5.25.62.gtf.gz")

## [1] TRUE
flattenedAnnotation = exonicParts( txdb, linked.to.single.gene.only=TRUE )
names(flattenedAnnotation) =
    sprintf("%s:E%0.3d", flattenedAnnotation$gene_id, flattenedAnnotation$exonic_part)

Note that the code above first converted the annotation GTF file into a transcriptDb object, but transcripts could also start from txdb objects available as Bioconductor packages or from objects available through AnnotationHub.

2.5 Counting reads

For each BAM file, we count the number of reads that overlap with each of the exon counting bin defined in the previous section. The read counting step is performed by the summarizeOverlaps() function of the GenomicAlignments package. To demonstrate this step, we will use a subset of the pasilla BAM files that are available through the pasillaBamSubset package.

bamFiles = c( untreated1_chr4(), untreated3_chr4() )
bamFiles = BamFileList( bamFiles )
seqlevelsStyle(flattenedAnnotation) = "UCSC"
se = summarizeOverlaps(
    flattenedAnnotation, BamFileList(bamFiles), singleEnd=FALSE,
    fragments=TRUE, ignore.strand=TRUE )

All singleEnd=, fragments=, ignore.strand= and strandMode= are important parameters to tune and are dataset specific. Two alternative ways to do the read counting process are the HTSeq python scripts provided with DEXSeq and the Rsubread package. Both of these approaches are described in the appendix section of this vignette.

2.6 Building a DEXSeqDataSet

We can use the function DEXSeqDataSetFromSE() to build an DEXSeqDataSet object. In a well-designed experiment, we would specify the conditions of the experiment in the colData() slot of the object. Since we have only two samples in our example, we just exemplify this step by specify the sample names as conditions.

colData(se)$condition = factor(c("untreated1", "untreated3"))
colData(se)$libType = factor(c("single-end", "paired-end"))
dxd = DEXSeqDataSetFromSE( se, design= ~ sample + exon + condition:exon )

3 Standard analysis workflow

3.1 Loading and inspecting the example data

To demonstrate the rest of the work flow we will use a DEXSeqDataSet that contains a subset of the data with only a few genes. This example object is available through the pasilla package.

data(pasillaDEXSeqDataSet, package="pasilla")

The DEXSeqDataSet class is derived from the DESeqDataSet. As such, it contains the usual accessor functions for the column data, row data, and some specific ones. The core data in an DEXSeqDataSet object are the counts per exon. Each row of the DEXSeqDataSet contains in each column the count data from a given exon (‘this’) as well as the count data from the sum of the other exons belonging to the same gene (‘others’). This annotation, as well as all the information regarding each column of the DEXSeqDataSet, is specified in the colData().

## DataFrame with 14 rows and 4 columns
##           sample condition        type     exon
##         <factor>  <factor>    <factor> <factor>
## 1   treated1fb   treated   single-read     this
## 2   treated2fb   treated   paired-end      this
## 3   treated3fb   treated   paired-end      this
## 4   untreated1fb untreated single-read     this
## 5   untreated2fb untreated single-read     this
## ...          ...       ...         ...      ...
## 10  treated3fb   treated   paired-end    others
## 11  untreated1fb untreated single-read   others
## 12  untreated2fb untreated single-read   others
## 13  untreated3fb untreated paired-end    others
## 14  untreated4fb untreated paired-end    others

We can access the first 5 rows from the count data by doing,

head( counts(dxd), 5 )
##                  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## FBgn0000256:E001   92   28   43   54  131   51   49 1390  829   923  1115  2495
## FBgn0000256:E002  124   80   91   76  224   82   95 1358  777   875  1093  2402
## FBgn0000256:E003  340  241  262  347  670  260  297 1142  616   704   822  1956
## FBgn0000256:E004  250  189  201  219  507  242  250 1232  668   765   950  2119
## FBgn0000256:E005   96   38   39   71   76   57   62 1386  819   927  1098  2550
##                  [,13] [,14]
## FBgn0000256:E001  1054  1052
## FBgn0000256:E002  1023  1006
## FBgn0000256:E003   845   804
## FBgn0000256:E004   863   851
## FBgn0000256:E005  1048  1039

Note that the number of columns is 14, the first seven (we have seven samples) corresponding to the number of reads mapping to out exonic regions and the last seven correspond to the sum of the counts mapping to the rest of the exons from the same gene on each sample.

split( seq_len(ncol(dxd)), colData(dxd)$exon )
## $this
## [1] 1 2 3 4 5 6 7
## $others
## [1]  8  9 10 11 12 13 14

We can also access only the first five rows from the count belonging to the exonic regions (‘this’) (without showing the sum of counts from the rest of the exons from the same gene) by doing,

head( featureCounts(dxd), 5 )
##                  treated1fb treated2fb treated3fb untreated1fb untreated2fb
## FBgn0000256:E001         92         28         43           54          131
## FBgn0000256:E002        124         80         91           76          224
## FBgn0000256:E003        340        241        262          347          670
## FBgn0000256:E004        250        189        201          219          507
## FBgn0000256:E005         96         38         39           71           76
##                  untreated3fb untreated4fb
## FBgn0000256:E001           51           49
## FBgn0000256:E002           82           95
## FBgn0000256:E003          260          297
## FBgn0000256:E004          242          250
## FBgn0000256:E005           57           62

In both cases, the rows are labelled with gene IDs (here Flybase IDs), followed by a colon and the counting bin number. Since a counting bin corresponds to an exon or part of an exon, this ID is called the feature ID or exon ID within DEXSeq. The table content indicates the number of reads that have been mapped to each counting bin in the respective sample.

To see details on the counting bins, we also print the first 3 lines of the feature data annotation:

head( rowRanges(dxd), 3 )
## GRanges object with 3 ranges and 5 metadata columns:
##                    seqnames          ranges strand |   featureID     groupID
##                       <Rle>       <IRanges>  <Rle> | <character> <character>
##   FBgn0000256:E001    chr2L 3872658-3872947      - |        E001 FBgn0000256
##   FBgn0000256:E002    chr2L 3873019-3873322      - |        E002 FBgn0000256
##   FBgn0000256:E003    chr2L 3873385-3874395      - |        E003 FBgn0000256
##                    exonBaseMean exonBaseVar
##                       <numeric>   <numeric>
##   FBgn0000256:E001       64.000     1250.67
##   FBgn0000256:E002      110.286     2769.57
##   FBgn0000256:E003      345.286    22147.90
##                                                transcripts
##                                                     <list>
##   FBgn0000256:E001 FBtr0077511,FBtr0077513,FBtr0077512,...
##   FBgn0000256:E002 FBtr0077511,FBtr0077513,FBtr0077512,...
##   FBgn0000256:E003 FBtr0077511,FBtr0077513,FBtr0077512,...
##   -------
##   seqinfo: 14 sequences from an unspecified genome; no seqlengths

So far, this table contains information on the annotation data, such as gene and exon IDs, genomic coordinates of the exon, and the list of transcripts that contain an exon.

The accessor function annotationData() shows the design table with the sample annotation (which was passed as the second argument to DEXSeqDataSetFromHTSeq()):

sampleAnnotation( dxd )
## DataFrame with 7 rows and 3 columns
##         sample condition        type
##       <factor>  <factor>    <factor>
## 1 treated1fb   treated   single-read
## 2 treated2fb   treated   paired-end 
## 3 treated3fb   treated   paired-end 
## 4 untreated1fb untreated single-read
## 5 untreated2fb untreated single-read
## 6 untreated3fb untreated paired-end 
## 7 untreated4fb untreated paired-end

In the following sections, we will update the object by calling a number of analysis functions, always using the idiom dxd = someFunction( dxd ), which takes the dxd object, fills in the results of the performed computation and writes the returned and updated object back into the variable dxd.

3.2 Normalisation

DEXSeq uses the same method as DESeq and DESeq2, which is provided in the function estimateSizeFactors().

dxd = estimateSizeFactors( dxd )

3.3 Dispersion estimation

To test for differential exon usage, we need to estimate the variability of the data. This is necessary to be able to distinguish technical and biological variation (noise) from real effects on exon usage due to the different conditions. The information on the strength of the noise is inferred from the biological replicates in the data set and characterized by the so-called dispersion. In RNA-Seq experiments the number of replicates is typically too small to reliably estimate variance or dispersion parameters individually exon by exon, and therefore, variance information is shared across exons and genes, in an intensity-dependent manner.

In this section, we discuss simple one-way designs: In this setting, samples with the same experimental condition, as indicated in the condition factor of the design table (see above), are considered as replicates – and therefore, the design table needs to contain a column with the name condition. In Section 5, we discuss how to treat more complicated experimental designs which are not accommodated by a single condition factor.

To estimate the dispersion estimates, DEXSeq uses the approach of the package DESeq2. Internally, the functions from DEXSeq are called, adapting the parameters of the functions for the specific case of the DEXSeq model. Briefly, per-exon dispersions are calculated using a Cox-Reid adjusted profile likelihood estimation, then a dispersion-mean relation is fitted to this individual dispersion values andfinally, the fitted values are taken as a prior in order to shrink the per-exon estimates towards the fitted values. See the DESeq2 paper for the rational behind the shrinkage approach (Love, Huber, and Anders (2014)).

dxd = estimateDispersions( dxd )

As a shrinkage diagnostic, the DEXSeqDataSet inherits the method plotDispEsts() that plots the per-exon dispersion estimates versus the mean normalised count, the resulting fitted valuesand the a posteriori (shrinked) dispersion estimates (Figure 1).

plotDispEsts( dxd )
Fit Diagnostics. The initial per-exon dispersion estimates (shown by black points), the fitted mean-dispersion values function (red line), and the shrinked values in blue

Figure 1: Fit Diagnostics
The initial per-exon dispersion estimates (shown by black points), the fitted mean-dispersion values function (red line), and the shrinked values in blue

4 Testing for differential exon usage

Having the dispersion estimates and the size factors, we can now test for differential exon usage. For each gene, DEXSeq fits a generalized linear model with the formula ~sample + exon + condition:exon and compare it to the smaller model (the null model) ~ sample + exon.

In these formulae (which use the standard notation for linear model formulae in R; consult a text book on R if you are unfamiliar with it), sample is a factor with different levels for each sample, condition is the factor of experimental conditions that we defined when constructing the DEXSeqDataSet object at the beginning of the analysis, and exon is a factor with two levels, this and others, that were specified when we generated our DEXSeqDataSet object. The two models described by these formulae are fit for each counting bin, where the data supplied to the fit comprise two read count values for each sample, corresponding to the two levels of the exon factor: the number of reads mapping to the bin in question (level this), and the sum of the read counts from all other bins of the same gene (level others). Note that this approach differs from the approach described in the paper (Anders, Reyes, and Huber (2012)) and used in older versions of DEXSeq; see Appendix 9.6 for further discussion.

We have an interaction term condition:exon, but denote no main effect for condition. Note, however, that all observations from the same sample are also from the same condition, i.e., the condition main effects are absorbed in the sample main effects, because the sample factor is nested within the condition factor.

The deviances of both fits are compared using a \(\chi^2\)-distribution, providing a \(p\) value. Based on this \(p\)-value, we can decide whether the null model is sufficient to explain the data, or whether it may be rejected in favour of the alternative model, which contains an interaction coefficient for condition:exon. The latter means that the fraction of the gene’s reads that fall onto the exon under the test differs significantly between the experimental conditions.

The function testForDEU() performs these tests for each exon in each gene.

dxd = testForDEU( dxd )

The resulting DEXSeqDataSet object contains slots with information regarding the test.

For some uses, we may also want to estimate relative exon usage fold changes. To this end, we call estimateExonFoldChanges(). Exon usage fold changes are calculated based on the coefficients of a GLM fit that uses the formula

count ~ condition + exon + condition:exon

where condition can be replaced with any of the column names of colData() (see man pages for details). The resulting coefficients allow the estimation of changes on the usage of exons across different conditions. Note that the differences on exon usage are distinguished from gene expression differences across conditions. For example, consider the case of a gene that is differentially expressed between conditions and has one exon that is differentially used between conditions. From the coefficients of the fitted model, it is possible to distinguish overall gene expression effects, that alter the counts from all the exons, from exon usage effects, that are captured by the interaction term condition:exon and that affect the each exon individually.

dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition")

So far in the pipeline, the intermediate and final results have been stored in the meta data of a DEXSeqDataSet object, they can be accessed using the function mcols(). In order to summarize the results without showing the values from intermediate steps, we call the function DEXSeqResults(). The result is a DEXSeqResults object, which is a subclass of a DataFrame object.

dxr1 = DEXSeqResults( dxd )
## LRT p-value: full vs reduced
## DataFrame with 498 rows and 13 columns
##                      groupID   featureID exonBaseMean dispersion        stat
##                  <character> <character>    <numeric>  <numeric>   <numeric>
## FBgn0000256:E001 FBgn0000256        E001      58.3431 0.01719172 8.23523e-06
## FBgn0000256:E002 FBgn0000256        E002     103.3328 0.00734737 1.57577e+00
## FBgn0000256:E003 FBgn0000256        E003     326.4763 0.01048065 3.57358e-02
## FBgn0000256:E004 FBgn0000256        E004     253.6548 0.01100420 1.69541e-01
## FBgn0000256:E005 FBgn0000256        E005      60.6377 0.04402967 2.91053e-02
## ...                      ...         ...          ...        ...         ...
## FBgn0261573:E012 FBgn0261573        E012     23.08422  0.0219840    8.400785
## FBgn0261573:E013 FBgn0261573        E013      9.79723  0.2475512    1.155232
## FBgn0261573:E014 FBgn0261573        E014     87.49678  0.0330407    1.118567
## FBgn0261573:E015 FBgn0261573        E015    268.25566  0.0120126    2.598309
## FBgn0261573:E016 FBgn0261573        E016    304.15134  0.0998526    0.146584
##                      pvalue      padj   treated untreated
##                   <numeric> <numeric> <numeric> <numeric>
## FBgn0000256:E001   0.997710         1   10.8820   11.0275
## FBgn0000256:E002   0.209370         1   13.5722   13.2500
## FBgn0000256:E003   0.850062         1   18.6029   18.9103
## FBgn0000256:E004   0.680520         1   17.2692   17.7029
## FBgn0000256:E005   0.864536         1   11.1006   10.9937
## ...                     ...       ...       ...       ...
## FBgn0261573:E012 0.00375059  0.097736   8.47412   6.61692
## FBgn0261573:E013 0.28245649  1.000000   3.79997   5.97080
## FBgn0261573:E014 0.29022728  1.000000  11.94851  13.19130
## FBgn0261573:E015 0.10697781  0.965623  17.26506  18.42453
## FBgn0261573:E016 0.70182189  1.000000  18.14410  18.91295
##                  log2fold_untreated_treated              genomicData
##                                   <numeric>                <GRanges>
## FBgn0000256:E001                  0.0513596  chr2L:3872658-3872947:-
## FBgn0000256:E002                 -0.1036538  chr2L:3873019-3873322:-
## FBgn0000256:E003                  0.0895553  chr2L:3873385-3874395:-
## FBgn0000256:E004                  0.1283218  chr2L:3874450-3875302:-
## FBgn0000256:E005                 -0.0375699  chr2L:3878895-3879067:-
## ...                                     ...                      ...
## FBgn0261573:E012                  -0.832683 chrX:19421654-19421867:+
## FBgn0261573:E013                   1.395559 chrX:19422668-19422673:+
## FBgn0261573:E014                   0.410997 chrX:19422674-19422856:+
## FBgn0261573:E015                   0.341489 chrX:19422927-19423634:+
## FBgn0261573:E016                   0.224594 chrX:19423707-19424937:+
##                        countData                             transcripts
##                         <matrix>                                  <list>
## FBgn0000256:E00