# Contents

R version: R version 4.2.1 (2022-06-23)

Bioconductor version: 3.16

Package: 1.19.0

Publication DOI: 10.12688/f1000research.15398.3

# 1 Getting help

If you have questions about this workflow or any Bioconductor software, please post these to the Bioconductor support site. If the questions concern a specific package, you can tag the post with the name of the package, or for general questions about the workflow, tag the post with rnaseqdtu. Note the posting guide for crafting an optimal question for the support site.

# 2 Introduction

RNA-seq experiments can be analyzed to detect differences across groups of samples in total gene expression – the total expression produced by all isoforms of a gene – and additionally differences in transcript isoform usage within a gene. If the amount of expression switches among two or more isoforms of a gene, then the total gene expression may not change by a detectable amount, but the differential transcript usage (DTU) is nevertheless biologically relevant. DTU is common when comparing expression across cell types: recent analysis of the Genotype-Tissue Expression Project (GTEx) (GTEx Consortium et al. 2017) dataset demonstrated that half of all expressed genes contained tissue-specific isoforms (Reyes and Huber 2018). DTU may produce functionally different gene products through alternative splicing and changes to the coding sequence of the transcript, and DTU may result in transcripts with different untranslated regions on the 5’ or 3’ end of the transcript, which may affect binding sites of RNA binding proteins. (Reyes and Huber 2018) found that alternative usage of transcription start and termination sites was a more common event than alternative splicing when examining DTU events across tissues in GTEx. Specific patterns of DTU have been identified in a number of diseases, including cancer, retinal diseases, and neurological disorders, among others (Scotti and Swanson 2018). Large-scale analyses of cancer transcriptomic data, comparing tumor to normal samples, have identified that protein domain losses are a common feature of DTU in cancer, including domains involved in protein-protein interactions (Vitting-Seerup and Sandelin 2017; Climente-González et al. 2017).

While many tutorials and workflows in the Bioconductor project address differential gene expression, there are fewer workflows for performing a differential transcript usage analysis, which provides critical and complementary information to a gene-level analysis. Some of the existing Bioconductor packages and functions that can be used for statistical analysis of DTU include DEXSeq (originally designed for differential exon usage) (Anders, Reyes, and Huber 2012), diffSpliceDGE from the edgeR package (Robinson, McCarthy, and Smyth 2009; McCarthy, Chen, and Smyth 2012), diffSplice from the limma package (Smyth 2004; Law et al. 2014), and DRIMSeq (Nowicka and Robinson 2016). Other Bioconductor packages which are designed around a DTU analysis include stageR (Van den Berge et al. 2017), SGSeq (Goldstein et al. 2016), and IsoformSwitchAnalyzeR (Vitting-Seerup and Sandelin 2018). stageR can be used for post-processing of transcript- and gene-level p-values from DTU detection methods, and will be discussed in this workflow. SGSeq can be used to visualize splice events, and leverages DEXSeq or limma for differential testing of splice variant usage. The Bioconductor package IsoformSwitchAnalyzeR is well documented and the vignette available from the IsoformSwitchAnalyzeR landing page can be seen as an alternative to this workflow. IsoformSwitchAnalyzeR is designed for the downstream analysis of functional consequences of identified isoform switches. It allows for import of data from various quantification methods, including Salmon, and allows for statistical inference using DRIMSeq. In addition, IsoformSwitchAnalyzeR includes functions for obtaining the nucleotide and amino acid sequence consequences of isoform switching, which is not covered in this workflow. Other packages related to splicing can be found at the BiocViews links for DifferentialSplicing and AlternativeSplicing. For more information about the Bioconductor project and its core infrastructure, please refer to the overview by Huber et al. (2015).

We note that there are numerous other methods for detecting differential transcript usage outside of the Bioconductor project. The DRIMSeq publication is a good reference for these, having descriptions and comparisons with many current methods (Nowicka and Robinson 2016). This workflow will build on the methods and vignettes from three Bioconductor packages: DRIMSeq, DEXSeq, and stageR.

Previously, some of the developers of the Bioconductor packages edgeR and DESeq2 have collaborated to develop the tximport package (Soneson, Love, and Robinson 2016) for summarizing the output of fast transcript-level quantifiers, such as Salmon (Patro et al. 2017), Sailfish (Patro, Mount, and Kingsford 2014), and kallisto (Bray et al. 2016). The tximport package focuses on preparing estimated transcript-level counts, abundances and effective transcript lengths, for gene-level statistical analysis using edgeR (Robinson, McCarthy, and Smyth 2009), DESeq2 (Love, Huber, and Anders 2014), or limma-voom (Law et al. 2014). tximport produces an offset matrix to accompany gene-level counts, that accounts for a number of RNA-seq biases as well as differences in transcript usage among transcripts of different length that would bias an estimator of gene fold change based on the gene-level counts (Trapnell et al. 2013). tximport can alternatively produce a matrix of data that is roughly on the scale of counts, by scaling transcript-per-million (TPM) abundances to add up to the total number of mapped reads. This counts-from-abundance approach directly corrects for technical biases and differential transcript usage across samples, obviating the need for the accompanying offset matrix.

Complementary to an analysis of differential gene expression, one can use tximport to import transcript-level estimated counts, and then pass these counts to packages such as DRIMSeq or DEXSeq for statistical analysis of differential transcript usage. Following a transcript-level analysis, one can aggregate evidence of differential transcript usage to the gene level. The stageR package in Bioconductor provides a statistical framework to screen at the gene-level for differential transcript usage with gene-level adjusted p-values, followed by confirmation of which transcripts within the significant genes show differential usage with transcript-level adjusted p-values (Van den Berge et al. 2017). The method controls the overall false discovery rate (OFDR) (Heller et al. 2009) for such a two-stage procedure, which will be discussed in more detail later in the workflow. We believe that stageR represents a principled approach to analyzing transcript usage changes, as the methods can be evaluated against a target error rate in a manner that mimics how the methods will be used in practice. That is, following rejection of the null hypothesis at the gene-level, investigators would likely desire to know which transcripts within a gene participated in the differential usage.

Here we provide a basic workflow for detecting differential transcript usage using Bioconductor packages, following quantification of transcript abundance using the Salmon method (Figure 1). This workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing stage-wise testing of differential transcript usage using the stageR package. For the workflow, we use data that is simulated, so that we can also evaluate the performance of methods for differential transcript usage, as well as differential gene and transcript expression. The simulation was constructed using distributional parameters estimated from the GEUVADIS project RNA-seq dataset (Lappalainen et al. 2013) quantified by the recount2 project (Collado-Torres et al. 2017), including the expression levels of the transcripts, the amount of biological variability of gene expression levels across samples, and realistic coverage of reads along the transcript.

Note that a published version of this workflow exists in the Bioconductor channel of the journal F1000Research. The published version additionally contains evaluations of the methods shown here alongside other popular methods for DTU, DGE, and DTE.

# 3 Methods

## 3.1 Simulation

First we describe details of the simulated data, which will be used in the following workflow. Understanding the details of the simulation will be useful for assessing the methods in the later sections. All of the code used to simulate RNA-seq experiments and write paired-end reads to FASTQ files can be found at an associated GitHub repository for the simulation code (Love 2018a), and the reads and quantification files can be downloaded from Zenodo (Love 2018b). Salmon (Patro et al. 2017) was used to estimate transcript-level abundances for a single sample (ERR188297) of the GEUVADIS project (Lappalainen et al. 2013), and this was used as a baseline for transcript abundances in the simulation. Transcripts that were associated with estimated counts less than 10 had abundance thresholded to 0, all other transcripts were considered “expressed” (n=46,933). alpine (Love, Hogenesch, and Irizarry 2016) was used to estimate realistic fragment GC bias from 12 samples from the GEUVADIS project, all from the same sequencing center (the first 12 samples from CNAG-CRG in Supplementary Table 2 from Love, Hogenesch, and Irizarry (2016)). DESeq2 (Love, Huber, and Anders 2014) was used to estimate mean and dispersion parameters for a Negative Binomial distribution for gene-level counts for 458 GEUVADIS samples provided by the recount2 project (Collado-Torres et al. 2017). Note that, while gene-level dispersion estimates were used to generate underlying transcript-level counts, additional uncertainty on the transcript-level data is a natural consequence of the simulation, as the transcript-level counts must be estimated (the underlying transcript counts are not provided to the methods).

polyester (Frazee et al. 2015) was used to simulate paired-end RNA-seq reads for two groups of 12 samples each, with realistic fragment GC bias, and with dispersion on transcript-level counts drawn from the joint distribution of mean and dispersion values estimated from the GEUVADIS samples. To compare DRIMSeq and DEXSeq in further detail, we generated an additional simulation in which dispersion parameters were assigned to genes via matching on the gene-level count, and then all transcripts of a gene had counts generated using the same per-gene dispersion. The first sample for group 1 and the first sample for group 2 followed the realistic GC bias profile of the same GEUVADIS sample, and so on for all 12 samples. This pairing of the samples was used to generate balanced data, but not used in the statistical analysis. countsimQC (Soneson and Robinson 2017) was used to examine the properties of the simulation relative to the dataset used for parameter estimation, and the full report can be accessed at the associated GitHub repository for simulation code (Love 2018a).

Differential expression across two groups was generated as follows: The 46,933 expressed transcripts as defined above belonged to 15,017 genes. 70% (n=10,514) of these genes with expressed transcripts were set as null genes, where abundance was not changed across the two groups. For 10% (n=1,501) of genes, all isoforms were differentially expressed at a log fold change between 1 and 2.58 (fold change between 2 and 6). The set of transcripts in these genes was classified as DGE (differential gene expression) by construction, and the expressed transcripts were also DTE (differential transcript expression), but they did not count as DTU (differential transcript usage), as the proportions within the gene remained constant. To simulate balanced differential expression, one of the two groups was randomly chosen to be the baseline, and the other group would have its counts multiplied by the fold change. For 10% (n=1,501) of genes, a single expressed isoform was differentially expressed at a log fold change between 1 and 2.58. This set of transcripts was DTE by construction. If the chosen transcript was the only expressed isoform of a gene, this counted also as DGE and not as DTU, but if there were other isoforms that were expressed, this counted for both DGE and DTU, as the proportion of expression among the isoforms was affected. For 10% (n=1,501) of genes, differential transcript usage was constructed by exchanging the TPM abundance of two expressed isoforms, or, if only one isoform was expressed, exchanging the abundance of the expressed isoform with a non-expressed one. This counted for DTU and DTE, but not for DGE. An MA plot of the simulated transcript abundances for the two groups is shown in Figure 2.

## 3.2 Operation

This workflow was designed to work with R 3.5 or higher, and the DRIMSeq, DEXSeq, stageR, and tximport packages for Bioconductor version 3.7 or higher. Bioconductor packages should always be installed following the official instructions. The workflow uses a subset of all genes to speed up the analysis, but the Bioconductor packages can easily be run for this dataset on all human genes on a laptop in less than an hour. Timing for the various packages is included within each section.

## 3.3 Quantification and data import

As described in the Introduction, this workflow uses transcript-level quantification estimates produced by Salmon (Patro et al. 2017) and imported into R/Bioconductor with tximport (Soneson, Love, and Robinson 2016). Details about how to run Salmon, and which type of transcript-level estimated counts should be imported, is covered in the Workflow, with the exact code used to run the DTU analysis. Salmon estimates the relative abundance of each annotated transcript for each sample in units of transcripts-per-million (TPM); the estimated TPM values should be proportional to the abundance of the transcripts in the population of cells that were assayed. One critical point is that Salmon only considers the transcripts that are provided in the annotation; it is not able to detect expression of any novel transcripts. If many un-annotated transcripts are expressed in the particular set of samples, successful application of this workflow may require first building out a more representative set of annotated transcripts via transcriptome assembly or full transcript sequencing.

In addition to relative abundance, Salmon estimates the effective length of each transcript, which can take into account a number of sample-specific technical biases including fragment size distribution (default), fragment GC content, random hexamer priming bias, and positional bias along the transcript. If a transcript had certain properties, related to its length and its sequence content, that made it difficult to produce cDNA fragments and sequence reads from these fragments, then its effective length should be lower for that sample, than if these technical biases were absent. The estimates of TPM and the effective lengths can be used to estimate the number of fragments that each transcript produced, which will be called estimated counts in this workflow. Different types of estimated counts may be correlated with effective transcript length, and so we will discuss in the Workflow our recommended type to use for DTU and DGE analysis (also diagrammed in Figure 1).

## 3.4 DTU testing

We focus on two statistical models for DTU testing in this workflow, DRIMSeq (Nowicka and Robinson 2016) and DEXSeq (Anders, Reyes, and Huber 2012). DEXSeq was published first, as a statistical model for detecting differences in exon usage across samples in different conditions. Most of the DEXSeq functions and documentation refer specifically to exons or exonic parts within a gene, while the final results table refers more generally to these as features within a group. It is this more general usage that we will employ in this workflow, substituting estimated transcript counts in place of exonic part counts.

DEXSeq assumes a Negative Binomial (NB) distribution for the feature counts, and considers the counts for each feature (originally, the exonic parts) relative to counts for all other features of the group (the gene), using an interaction term in a generalized linear model (GLM). The GLM framework is an extension of the linear model (LM), but shares with LM the usage of a design matrix, typically represented by $$\mathbf{X}$$, which is made up of columns of covariates that are multiplied by scalar coefficients, typically represented by $$\mathbf{\beta}$$. The design matrix with its multiple coefficients is useful for extending statistical models beyond simple group comparisons, allowing for more complex situations, such as within-patient comparisons, batch correction, or testing of ratios.

DEXSeq models each feature independently in the GLM fitting stage, and so does not take into account any correlation among the counts across features in a group (e.g. transcript counts within a gene), except insofar as such correlation is accounted for by columns in the design matrix. This last point is important, as correlation induced by DTU across condition groups, or even DTU that can be associated with cell-type heterogeneity, can be modeled in the DEXSeq framework by interaction terms with relevant covariates introduced into the design matrix. In the current workflow, we provide an example of capturing DTU across condition using DEXSeq, but future iterations of this workflow may also include controlling for additional covariates, such as estimated cell type proportions. DEXSeq was evaluated on transcript counts by Soneson et al. (2016) and then later by Nowicka and Robinson (2016), where it was shown in both cases that DEXSeq could be used to detect DTU in this configuration, though isoform pre-filtering greatly improved its performance in reducing the observed false discovery rate (FDR) closer to its nominal level.

In contrast to the NB model, DRIMSeq assumes an Dirichlet Multinomial model (DM) for each gene, where the total count for the gene is considered fixed, and the quantity of interest is the proportion for the transcript within a gene for each sample. If the proportion for one transcript increases, it must result in a decrease for the proportions of the other transcripts of the gene. Genes that are detected as having statistically significant DTU are those in which the proportion changes observed across condition were large, considering the variation in proportions seen within condition. The variation in proportions across biological replicates is modeled using a single precision parameter per gene, which will be discussed in the workflow below. DRIMSeq also uses a design matrix, and so can be used to analyze DTU for complex experimental designs, including within-patient comparisons, batch correction, or testing of ratios.

A critical difference between the two models is that DRIMSeq directly models the correlations among transcript counts within a gene via the DM distribution, and so can capture these correlations across exchangeable samples within a condition. DEXSeq with the NB distribution only can capture correlations among transcript counts within a gene when the DTU is across sample groups defined by covariates in the design matrix. On the other hand, DRIMSeq is limited in modeling a single precision parameter per gene. If there are two transcripts within a gene with very different biological variability – perhaps they have separate promoters under different regulatory control – then DEXSeq may do a better job modeling such heterogeneity in the biological variability of transcript expression, as it estimates a separate dispersion parameter for each transcript.