1 Overview

SGSeq provides a framework for analyzing splice events from RNA-seq data. As part of this framework, the software implements data structures that are designed to store information relevant for splice variant analysis. An overview of SGSeq classes and how they are related is shown in Figure 1 and summarized below:

The TxFeatures class stores discrete transcript features (exons and splice junctions) as they are observed in RNA transcripts. These can be extracted from annotation or predicted from aligned RNA-seq reads.
The SGFeatures class stores features defining a splice graph (Heber et al. 2002). The splice graph is a directed acyclic graph with edges corresponding to exonic regions and splice junctions, and nodes corresponding to transcript starts, ends and splice sites. It is directed from the 5$^\prime$ end to the 3$^\prime$ end of a gene.
The SGVariants class stores splice variants. If two nodes in the splice graph are connected by two or more paths and there are no intervening nodes with all paths intersecting, the alternative paths are considered splice variants. Splice variants sharing the same start and end node, together form a splice event.

$Figure 1. Overview of SGSeq data structures. (a) Schematic illustrating transcripts, discrete transcript features, the splice graph, and splice variants. Splice variants are defined in an augmented graph with a unique source and sink node for each gene, which are connected to alternative starts and ends, respectively (dashed lines). (b) SGSeq representation of concepts shown in (a). Classes are shown in bold and outlined, function names are shown in italics. Dashed arrows indicate functions analyzeFeatures() and analyzeVariants(), which wrap multiple analysis steps. SGSeq makes extensive use of Bioconductor infrastructure for genomic ranges (Lawrence et al. 2013): TxFeatures and SGFeatures extend GRanges, SGVariants extends GRangesList. SGFeatureCounts and SGVariantCounts extend SummarizedExperiment and are containers for per-sample counts (or other expression values) along with corresponding SGFeatures and SGVariants objects. ^\astSGVariantCounts assay countsVariant5pOr3p can only be obtained from BAM files, for details see section Testing for differential splice variant usage.$

Figure 1. Overview of SGSeq data structures. (a) Schematic illustrating transcripts, discrete transcript features, the splice graph, and splice variants. Splice variants are defined in an augmented graph with a unique source and sink node for each gene, which are connected to alternative starts and ends, respectively (dashed lines). (b) SGSeq representation of concepts shown in (a). Classes are shown in bold and outlined, function names are shown in italics. Dashed arrows indicate functions analyzeFeatures() and analyzeVariants(), which wrap multiple analysis steps. SGSeq makes extensive use of Bioconductor infrastructure for genomic ranges (Lawrence et al. 2013): TxFeatures and SGFeatures extend GRanges, SGVariants extends GRangesList. SGFeatureCounts and SGVariantCounts extend SummarizedExperiment and are containers for per-sample counts (or other expression values) along with corresponding SGFeatures and SGVariants objects. $^\ast$SGVariantCounts assay countsVariant5pOr3p can only be obtained from BAM files, for details see section Testing for differential splice variant usage.

2 Preliminaries

library(SGSeq)

When starting a new project, SGSeq requires information on the samples to be analyzed. This information is obtained once initially, and can then be used for all subsequent analyses. Sample information is provided as a data frame with the following columns:

sample_name Character vector with a unique name for each sample
file_bam Character vector or BamFileList specifying BAM files generated with a splice-aware alignment program
paired_end Logical vector indicating whether data are paired-end or single-end
read_length Numeric vector with read lengths
frag_length Numeric vector with average fragment lengths (for paired-end data)
lib_size Numeric vector with the total number of aligned reads for single-end data, or the total number of concordantly aligned read pairs for paired-end data

Sample information can be stored in a data.frame or DataFrame object (if BAM files are specified as a BamFileList, it must be stored in a DataFrame). Sample information can be obtained automatically with function getBamInfo(), which takes as input a data frame with columns sample_name and file_bam and extracts the required information from the specified BAM files.

For SGSeq to work correctly it is essential that reads were mapped with a splice-aware alignment program, such as GSNAP (T. D. Wu and Nacu 2010), HISAT (Kim, Langmead, and Salzberg 2015) or STAR (Dobin et al. 2013), which generate SAM/BAM files with custom tag ‘XS’ for spliced reads, indicating the direction of transcription. Note that lib_size should be the total number of aligned fragments, even if BAM files were subset to genomic regions of interest. The total number of fragments is required for accurate calculation of FPKM values (fragments per kilobase and million sequenced fragments). Here, the term ‘fragment’ denotes a sequenced cDNA fragment, which is represented by a single read in single-end data, or a pair of reads in paired-end data.

This vignette illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in (Seshagiri et al. 2012). RNA-seq reads were mapped to the human reference genome using GSNAP (T. D. Wu and Nacu 2010). The analysis is based on BAM files subset to reads mapping to a single gene of interest (FBXO31). A data.frame si with sample information was generated from the original BAM files with function getBamInfo(). Note that column lib_size reflects the total number of aligned fragments in the original BAM files.

si

##   sample_name file_bam paired_end read_length frag_length lib_size
## 1          N1   N1.bam       TRUE          75         293 12405197
## 2          N2   N2.bam       TRUE          75         197 13090179
## 3          N3   N3.bam       TRUE          75         206 14983084
## 4          N4   N4.bam       TRUE          75         207 15794088
## 5          T1   T1.bam       TRUE          75         284 14345976
## 6          T2   T2.bam       TRUE          75         235 15464168
## 7          T3   T3.bam       TRUE          75         259 15485954
## 8          T4   T4.bam       TRUE          75         247 15808356

The following code block sets the correct BAM file paths for the current SGSeq installation.

path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)

3 RNA transcripts and the TxFeatures class

Transcript annotation can be obtained via a TxDb object or imported from GFF format using function importTranscripts(). Alternatively, transcripts can be specified as a GRangesList of exons grouped by transcript. In the following code block, the UCSC knownGene table is loaded as a TxDb object. Transcripts on chromosome 16 (where the FBXO31 gene is located) are retained, and chromosome names are changed to match the naming convention used in the BAM files.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"

To work with the annotation in the SGSeq framework, transcript features are extracted from the TxDb object using function convertToTxFeatures(). Only features overlapping the genomic locus of the FBXO31 gene are retained. The genomic coordinates of FBXO31 are stored in a GRanges object gr.

txf_ucsc <- convertToTxFeatures(txdb)
txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
head(txf_ucsc)

## TxFeatures object with 6 ranges and 0 metadata columns:
##       seqnames               ranges strand     type
##          <Rle>            <IRanges>  <Rle> <factor>
##   [1]       16 [87362942, 87365116]      -        L
##   [2]       16 [87365116, 87367492]      -        J
##   [3]       16 [87367492, 87367892]      -        I
##   [4]       16 [87367892, 87368910]      -        J
##   [5]       16 [87368910, 87369063]      -        I
##   [6]       16 [87369063, 87369761]      -        J
##                                 txName        geneName
##                        <CharacterList> <CharacterList>
##   [1] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [2] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [3] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [4] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [5] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [6] uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   -------
##   seqinfo: 1 sequence from hg19 genome

convertToTxFeatures() returns a TxFeatures object, which is a GRanges-like object with additional columns. Column type indicates the feature type and can take values

J (splice junction)
I (internal exon)
F (first/5$^\prime$-terminal exon)
L (last/5$^\prime$-terminal exon)
U (unspliced transcript).

Columns txName and geneName indicate the transcript and gene each feature belongs to. Note that a feature can belong to more than one transcript and accordingly the columns can store multiple values for each feature. For TxFeatures, and other data structures defined in SGSeq, columns can be accessed using functions named after the relevant columns, as shown in the following code block.

type(txf_ucsc)

##  [1] L J I J I J I J I J I J I J I F J J F U I J F
## Levels: J I F L U

head(txName(txf_ucsc))

## CharacterList of length 6
## [[1]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[2]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[3]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[4]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[5]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[6]] uc002fjv.3 uc002fjw.3 uc010vot.2

head(geneName(txf_ucsc))

## CharacterList of length 6
## [[1]] 79791
## [[2]] 79791
## [[3]] 79791
## [[4]] 79791
## [[5]] 79791
## [[6]] 79791

4 The splice graph and the SGFeatures class

Exons in TxFeatures correspond to exons spliced together in an RNA molecule. In the context of the splice graph, exons are represented by unique non-overlapping exonic regions. Function convertToSGFeatures() converts TxFeatures to SGFeatures. In the process, overlapping exons are disjoined into non-overlapping exon bins.

sgf_ucsc <- convertToSGFeatures(txf_ucsc)
head(sgf_ucsc)

## SGFeatures object with 6 ranges and 0 metadata columns:
##       seqnames               ranges strand     type  splice5p  splice3p
##          <Rle>            <IRanges>  <Rle> <factor> <logical> <logical>
##   [1]       16 [87362942, 87365116]      -        E      TRUE     FALSE
##   [2]       16 [87365116, 87365116]      -        A      <NA>      <NA>
##   [3]       16 [87365116, 87367492]      -        J      <NA>      <NA>
##   [4]       16 [87367492, 87367492]      -        D      <NA>      <NA>
##   [5]       16 [87367492, 87367892]      -        E      TRUE      TRUE
##   [6]       16 [87367892, 87367892]      -        A      <NA>      <NA>
##       featureID    geneID                           txName        geneName
##       <integer> <integer>                  <CharacterList> <CharacterList>
##   [1]         1         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [2]         2         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [3]         3         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [4]         4         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [5]         5         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [6]         6         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   -------
##   seqinfo: 1 sequence from hg19 genome

Similar to TxFeatures, an SGFeatures object is a GRanges-like object with additional columns. Column type for an SGFeatures object takes values

J (splice junction)
E (disjoint exon bin)
D (splice donor site)
A (splice acceptor site).

By convention, splice donor and acceptor sites correspond to the exonic positions immediately flanking the intron. SGFeatures has additional columns not included in TxFeatures: spliced5p and spliced3p indicate whether exon bins have a mandatory splice at the 5$^\prime$ and 3$^\prime$ end, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, and whether an exon bin is consistent with an annotated transcript. Column featureID provides a unique identifier for each feature, while columnn geneID indicates the unique connected component of the splice graph a feature belongs to.

5 Splice graph analysis based on annotated transcripts

This section illustrates an analysis based on annotated transcripts. Function analyzeFeatures() converts transcript features to splice graph features and obtains compatible fragment counts for each feature and each sample.

sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc

## class: SGFeatureCounts 
## dim: 42 8 
## metadata(0):
## assays(2): counts FPKM
## rownames: NULL
## rowData names(0):
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size

analyzeFeatures() returns an SGFeatureCounts object. SGFeatureCounts contains the sample information as colData, splice graph features as rowRanges and assays counts and FPKM, which store compatible fragment counts and FPKMs, respectively. The different data types can be accessed using accessor functions with the same name.

colData(sgfc_ucsc)
rowRanges(sgfc_ucsc)
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))

Counts for exons and splice junctions are based on structurally compatible fragments. In the case of splice donors and acceptors, counts indicate the number of fragments with reads spanning the spliced boundary (overlapping the splice site and the flanking intronic position).

FPKM values are calculated as $\frac{x}{NL}10^6$, where $x$ is the number of compatible fragments, $N$ is the library size (stored in lib_size) and L is the effective feature length (the number of possible positions for a compatible fragment). For paired-end data it is assumed that fragment length is equal to frag_length.

FPKMs for splice graph features can be visualized with function plotFeatures. plotFeatures generates a two-panel figure with a splice graph shown in the top panel and a heatmap of expression levels for individual features in the bottom panel. For customization of plotFeatures output, see section Visualization. The plotting function invisibly returns a data.frame with details about the splice graph features included in the plot.

df <- plotFeatures(sgfc_ucsc, geneID = 1)

Note that the splice graph includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in the samples in this data set.

6 Splice graph analysis based on de novo prediction

Instead of relying on existing annotation, annotation can be augmented with predictions from RNA-seq data, or the splice graph can be constructed from RNA-seq data without the use of annotation. The following code block predicts transcript features supported by RNA-seq reads, converts them into splice graph features, and then obtains compatible fragment counts. For details on how predictions are obtained, please see (Goldstein et al. 2016).

sgfc_pred <- analyzeFeatures(si, which = gr)
head(rowRanges(sgfc_pred))

## SGFeatures object with 6 ranges and 0 metadata columns:
##       seqnames               ranges strand     type  splice5p  splice3p
##          <Rle>            <IRanges>  <Rle> <factor> <logical> <logical>
##   [1]       16 [87362930, 87365116]      -        E      TRUE     FALSE
##   [2]       16 [87365116, 87365116]      -        A      <NA>      <NA>
##   [3]       16 [87365116, 87367492]      -        J      <NA>      <NA>
##   [4]       16 [87367492, 87367492]      -        D      <NA>      <NA>
##   [5]       16 [87367492, 87367892]      -        E      TRUE      TRUE
##   [6]       16 [87367892, 87367892]      -        A      <NA>      <NA>
##       featureID    geneID          txName        geneName
##       <integer> <integer> <CharacterList> <CharacterList>
##   [1]         1         1                                
##   [2]         2         1                                
##   [3]         3         1                                
##   [4]         4         1                                
##   [5]         5         1                                
##   [6]         6         1                                
##   -------
##   seqinfo: 84 sequences from an unspecified genome

For interpretation, predicted features can be annotated with respect to known transcripts. The annotate() function assigns compatible transcripts to each feature and stores the corresponding transcript and gene name in columns txName and geneName, respectively.

sgfc_pred <- annotate(sgfc_pred, txf_ucsc)
head(rowRanges(sgfc_pred))

## SGFeatures object with 6 ranges and 0 metadata columns:
##       seqnames               ranges strand     type  splice5p  splice3p
##          <Rle>            <IRanges>  <Rle> <factor> <logical> <logical>
##   [1]       16 [87362930, 87365116]      -        E      TRUE     FALSE
##   [2]       16 [87365116, 87365116]      -        A      <NA>      <NA>
##   [3]       16 [87365116, 87367492]      -        J      <NA>      <NA>
##   [4]       16 [87367492, 87367492]      -        D      <NA>      <NA>
##   [5]       16 [87367492, 87367892]      -        E      TRUE      TRUE
##   [6]       16 [87367892, 87367892]      -        A      <NA>      <NA>
##       featureID    geneID                           txName        geneName
##       <integer> <integer>                  <CharacterList> <CharacterList>
##   [1]         1         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [2]         2         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [3]         3         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [4]         4         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [5]         5         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   [6]         6         1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791
##   -------
##   seqinfo: 84 sequences from an unspecified genome

The predicted splice graph and FPKMs can be visualized as previously. Splice graph features with missing annotation are highlighted using argument color_novel.

df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")

Note that most exons and splice junctions predicted from RNA-seq data are consistent with transcripts in the UCSC knownGene table (shown in grey). However, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed in the data. Also, an unannotated exon (E3, shown in red) was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples (N2, N3, N4).

7 Splice variant identification

Instead of considering the complete splice graph of a gene, the analysis can be focused on individual splice events. Function analyzeVariants() identifies splice events recursively from the graph, obtains representative counts for each splice variant, and estimates their relative usage.

sgvc_pred <- analyzeVariants(sgfc_pred)
sgvc_pred

## class: SGVariantCounts 
## dim: 2 8 
## metadata(0):
## assays(5): countsVariant5p countsVariant3p countsEvent5p
##   countsEvent3p variantFreq
## rownames: NULL
## rowData names(20): from to ... variantType variantName
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size

analyzeVariants() returns an SGVariantCounts object. Sample information is stored as colData, and SGVariants as rowRanges. Assay variantFreq stores estimates of relative usage for each splice variant and sample. As previously, the different data types can be accessed using accessor functions with the same name. Information on splice variants is stored in SGVariants metadata columns and can be accessed with function mcols() as shown below. For a detailed description of columns, see the manual page for SGVariants.

mcols(sgvc_pred)

## DataFrame with 2 rows and 20 columns
##              from              to        type   featureID   segmentID
##       <character>     <character> <character> <character> <character>
## 1 D:16:87393901:- A:16:87380856:-           J          28           4
## 2 D:16:87393901:- A:16:87380856:-         JEJ    32,30,27           2
##    closed5p  closed3p closed5pEvent closed3pEvent    geneID   eventID
##   <logical> <logical>     <logical>     <logical> <integer> <integer>
## 1      TRUE      TRUE          TRUE          TRUE         1         1
## 2      TRUE      TRUE          TRUE          TRUE         1         1
##   variantID   featureID5p   featureID3p featureID5pEvent featureID3pEvent
##   <integer> <IntegerList> <IntegerList>    <IntegerList>    <IntegerList>
## 1         1            28            28            28,32            28,27
## 2         2            32            27            28,32            28,27
##                             txName        geneName     variantType
##                    <CharacterList> <CharacterList> <CharacterList>
## 1 uc002fjv.3,uc002fjw.3,uc010vot.2           79791            SE:S
## 2                                            79791            SE:I
##      variantName
##      <character>
## 1 79791_1_1/2_SE
## 2 79791_1_2/2_SE

8 Splice variant quantification

Splice variants are quantified locally, based on structurally compatible fragments that overlap the start or end of each variant. Local estimates of relative usage $\psi_i$ for variant $i$ are obtained as the number of fragments compatible with $i$ divided by the number of fragments compatible with any variant belonging to the same event. For variant start $S$ and variant end $E$ $\psi_i^S = x_i^S / x_.^S$ and $\psi_i^E = x_i^E / x_.^E$, respectively. For variants with valid estimates $\psi_i^S$ and $\psi_i^E$ a single estimate is calculated as a weighted mean of local estimates $\psi_i = x_.^S/(x_.^S + x_.^E) \psi_i^S + x_.^E/(x_.^S + x_.^E) \psi_i^E$.

Estimates of relative usage can be unreliable for events with low read count. If argument min_denominator is specified for functions analyzeVariants() or getSGVariantCounts(), estimates are set to NA unless at least one of $x_.^S$ or $x_.^E$ is equal or greater to the specified value.

Note that SGVariantCounts objects also store raw count data. These data can be used for statistical modeling, for example as suggested in section Testing for differential splice variant usage.

variantFreq(sgvc_pred)

##        N1   N2        N3        N4         T1         T2        T3
## [1,] 0.88 0.56 0.6153846 0.5925926 0.93333333 0.90196078 0.8484848
## [2,] 0.12 0.44 0.3846154 0.4074074 0.06666667 0.09803922 0.1515152
##              T4
## [1,] 0.95833333
## [2,] 0.04166667

Splice variants and estimates of relative usage are visualized with function plotVariants.

plotVariants(sgvc_pred, eventID = 1, color_novel = "red")

plotVariants generates a two-panel figure similar to plotFeatures. The splice graph in the top panel illustrates the selected splice event. In this example, the event consists of two variants, corresponding to a skip or inclusion of the unannotated exon. The heatmap illustrates estimates of relative usage for each splice variant. Samples N2, N3 and N4 show evidence for transcripts that include the exon as well as transcripts that skip the exon. The remaining samples show little evidence for exon inclusion.

9 Splice variant interpretation

The functional consequences of a splice variant can be assessed by predicting its effect on protein-coding potential. Function predictVariantEffects() takes as input an SGVariants object with splice variants of interest, a set of annotated transcripts, and a matching reference genome stored as a BSgenome object.

library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(Hsapiens) <- "NCBI"
vep <- predictVariantEffects(sgv_pred, txdb, Hsapiens)
vep

##   variantID     txName geneName                        RNA_change
## 1         2 uc002fjv.3    79791         r.88_89ins88+1798_88+1884
## 2         2 uc002fjw.3    79791     r.412_413ins412+1798_412+1884
## 3         2 uc010vot.2    79791 r.-105_-104ins-105+1798_-105+1884
##   RNA_variant_type                              protein_change
## 1        insertion   p.K29_L30insRINPRVKSGRFVKILPDYEHMAYRDVYTC
## 2        insertion p.K137_L138insRINPRVKSGRFVKILPDYEHMAYRDVYTC
## 3        insertion                                         p.=
##   protein_variant_type
## 1   in-frame_insertion
## 2   in-frame_insertion
## 3            no_change

The output is a data frame with each row describing the effect of a particular splice variant on an annotated protein-coding transcript. The effect of the variants is described following HGVS recommendations. In its current implementation, variant effect prediction is relatively slow, and it is recommended to run predictVariantEffects() on select variants only.

10 Visualization

Functions plotFeatures() and plotVariants() support many options for customizing figures. The splice graph in the top panel is plotted by function plotSpliceGraph, which can be called directly.

plotFeatures() includes multiple arguments for selecting features to be displayed. The following code block illustrates three different options for plotting the splice graph and expression levels for FBXO31 (Entrez ID 79791).

plotFeatures(sgfc_pred, geneID = 1)
plotFeatures(sgfc_pred, geneName = "79791")
plotFeatures(sgfc_pred, which = gr)

By default, the heatmap generated by plotFeatures() displays splice junctions. Alternatively, exon bins, or both exon bins and splice junctions can be displayed.

plotFeatures(sgfc_pred, geneID = 1, include = "junctions")
plotFeatures(sgfc_pred, geneID = 1, include = "exons")
plotFeatures(sgfc_pred, geneID = 1, include = "both")

Argument toscale controls which parts of the gene model are drawn to scale.

plotFeatures(sgfc_pred, geneID = 1, toscale = "gene")
plotFeatures(sgfc_pred, geneID = 1, toscale = "exon")
plotFeatures(sgfc_pred, geneID = 1, toscale = "none")

Heatmaps allow the visualization of expression values summarized for splice junctions and exon bins. Alternatively, per-base read coverages and splice junction counts can be visualized with function plotCoverage.

par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
  plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}

11 Testing for differential splice variant usage

SGSeq does not implement statistical tests for differential splice variant usage. However, existing software packages such as DEXSeq (Anders, Reyes, and Huber 2012) and limma (Ritchie et al. 2015) can be used for this purpose. These packages allow infererence of differential exon usage within a gene (between groups of samples). In the SGSeq framework, this approach can be used to test for differential splice variant usage within a splice event, whereby splice variants and splice events are used analogously to exons and genes, respectively.

For these methods to be applicable, a single count is required for each splice variant. SGVariantCounts objects as described above store two counts for each splice variant, one for the 5$^\prime$ and one for the 3$^\prime$ end of the variant. These counts can be readily obtained from SGFeatureCounts objects, but are impractical for count-based differential testing. A single count for each variant (based on fragments compatible at either end of the variant) can be obtained from BAM files using function getSGVariantCounts(). The output is an SGVariantCounts object with additional assay countsVariant5pOr3p.

sgv <- rowRanges(sgvc_pred)
sgvc <- getSGVariantCounts(sgv, sample_info = si)
sgvc

## class: SGVariantCounts 
## dim: 2 8 
## metadata(0):
## assays(6): countsVariant5p countsVariant3p ... countsVariant5pOr3p
##   variantFreq
## rownames: NULL
## rowData names(20): from to ... variantType variantName
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size

Performing differential tests requires per-variant counts, unique identifiers for each variant, and a variable indicating how variants are grouped by events. All three can be obtained from the SGVariantCounts object.

x <- counts(sgvc)
vid <- variantID(sgvc)
eid <- eventID(sgvc)

Treating these three objects analogously to per-exon counts, exon identifiers and gene identifiers, they can be used to construct a DEXSeqDataSet object for use with DEXSeq or as input for function diffSplice() in combination with voom() for use with limma.

12 Advanced usage

Functions analyzeFeatures() and analyzeVariants() wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the analysis based on de novo prediction can be performed as follows.

txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_ucsc)
sgfc <- getSGFeatureCounts(si, sgf)
sgv <- findSGVariants(sgf)
sgvc <- getSGVariantCounts(sgv, sgfc)

predictTxFeatures() predicts features for each sample, merges features across samples, and performs filtering and processing of predicted terminal exons. predictTxFeatures() and getSGFeatureCounts() can also be run on individual samples (e.g. for distribution across a high-performance computing cluster). When using predictTxFeatures() for individual samples, with predictions intended to be merged later, run predictTxFeatures() with argument min_overhang = NULL to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions mergeTxFeatures() and processTerminalExons(), respectively.

13 Exercises

Exercise 1 In SGSeq, known transcripts can be specified as a GRangeList of exons, whereby each entry of the GRangesList corresponds to one transcript. (a) Construct a GRangesList for a single transcript with three exons. (b) Convert the GRangesList to a TxFeatures object. (c) Repeat (a) and (b) to create a second transcript with exons that are shared or overlapping with exons in the first transcript. (d) Combine the two transcripts in a single GRangesList and convert it to a TxFeatures object. How does the output change? (e) Convert the TxFeatures object from (d) into an SGFeatures object. How does the TxFeatures object differ from the TxFeatures object? (f) Use the plotSpliceGraph() function to visualize the splice graph corresponding to the two transcripts.

tx_1 <- GRangesList(tx_1 = GRanges("1", IRanges(c(101, 301, 501), c(200, 400, 600)), "+"))
txf_1 <- convertToTxFeatures(tx_1)
tx_2 <- GRangesList(tx_2 = GRanges("1", IRanges(c(101, 351, 701), c(200, 400, 800)), "+"))
txf_2 <- convertToTxFeatures(tx_2)
txf <- convertToTxFeatures(c(tx_1, tx_2))
sgf <- convertToSGFeatures(txf)
par(mfrow = c(1, 1))
plotSpliceGraph(sgf)

The remaining exercises are based on genome-wide predictions obtained from paired-end RNA-seq data from 16 normal human tissues, which are part of the Illumina Body Map 2.0. The data were processed as shown in the following code block (the code will not run, since BAM files are not available as part of the workshop).

sgfc <- analyzeFeatures(si, alpha = 2, beta = 0.2, gamma = 0.2)
sgvc <- analyzeVariants(sgfc, min_denominator = 10)

Load the previously generated objects sgfc and sgvc, and restore seqlevels for the TxDb object (so all chromosomes are available). Note predictions in sgfc and sgvc have already been annotated using transcript features from the UCSC knownGene table.

data(sgfc, sgvc, package = "SGSeqBioC2016")
txdb <- restoreSeqlevels(txdb)
seqlevelsStyle(txdb) <- "NCBI"
txdb <- keepSeqlevels(txdb, c(1:22, "X", "Y"))

Exercise 2 (a) Use function plotSpliceGraph() to plot the gene model for gene KIFAP3 (Entrez ID 22920). (b) Which parts of the predicted gene model are unannotated? (c) Use function plotFeatures() to inspect expression levels for splice junctions or exons across the body map tissues.

sgf <- rowRanges(sgfc)
plotSpliceGraph(sgf, geneName = "22920")
plotSpliceGraph(sgf, geneName = "22920", color_novel = "red")
plotFeatures(sgfc, geneName = "22920", color_novel = "red")
plotFeatures(sgfc, geneName = "22920", color_novel = "red", include = "exons")

Exercise 3 (a) Inspect details for the splice event in the KIFAP3 gene. (b) Plot the unannotated splice event.

mcols(sgvc)[any(geneName(sgvc) == "22920"), ]
plotVariants(sgvc, eventID = 2872, color_novel = "red")

Exercise 4 (a) From the SGVariantCounts object, extract the variant corresponding to the novel cassette exon in the KIFAP3 gene. (b) Assess the effect of this variant on KIFAP3 protein-coding potential.

variant <- rowRanges(sgvc)[variantID(sgvc) == 6796]
variant <- keepSeqlevels(variant, c(1:22, "X", "Y"))
vep <- predictVariantEffects(variant, txdb, Hsapiens)
vep

Exercise 5 Following the approach outlined in Exercises 2-4, analyze genes ABCD3 (Entrez ID 5825), ESYT2 (Entrez ID 57488), ROCK2 (Entrez ID 9475) or other genes of interest.

14 Session information

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0      
##  [2] BSgenome_1.41.2                        
##  [3] rtracklayer_1.33.7                     
##  [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
##  [5] GenomicFeatures_1.25.12                
##  [6] AnnotationDbi_1.35.3                   
##  [7] SGSeq_1.7.6                            
##  [8] SummarizedExperiment_1.3.5             
##  [9] Biobase_2.33.0                         
## [10] Rsamtools_1.25.0                       
## [11] Biostrings_2.41.4                      
## [12] XVector_0.13.2                         
## [13] GenomicRanges_1.25.4                   
## [14] GenomeInfoDb_1.9.1                     
## [15] IRanges_2.7.6                          
## [16] S4Vectors_0.11.5                       
## [17] BiocGenerics_0.19.1                    
## [18] knitr_1.13                             
## [19] BiocStyle_2.1.8                        
## 
## loaded via a namespace (and not attached):
##  [1] igraph_1.0.1            Rcpp_0.12.5            
##  [3] magrittr_1.5            zlibbioc_1.19.0        
##  [5] GenomicAlignments_1.9.3 BiocParallel_1.7.4     
##  [7] stringr_1.0.0           tools_3.3.0            
##  [9] DBI_0.4-1               htmltools_0.3.5        
## [11] yaml_2.1.13             digest_0.6.9           
## [13] formatR_1.4             bitops_1.0-6           
## [15] RUnit_0.4.31            biomaRt_2.29.2         
## [17] RCurl_1.95-4.8          RSQLite_1.0.0          
## [19] evaluate_0.9            rmarkdown_0.9.6        
## [21] stringi_1.1.1           XML_3.98-1.4

References

Anders, S, A Reyes, and W Huber. 2012. “Detecting differential usage of exons from RNA-seq data.” Genome Research 22 (10): 2008–17.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21.

Goldstein, Leonard D, Yi Cao, Gregoire Pau, Michael Lawrence, Thomas D Wu, Somasekar Seshagiri, and Robert Gentleman. 2016. “Prediction and Quantification of Splice Events from RNA-Seq Data.” PLoS ONE 11 (5): e0156132.

Heber, Steffen, Max Alekseyev, Sing-Hoi Sze, Haixu Tang, and Pavel A Pevzner. 2002. “Splicing graphs and EST assembly problem.” Bioinformatics (Oxford, England) 18 Suppl 1: S181–8.

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: a fast spliced aligner with low memory requirements.” Nature Methods 12 (4): 357–60.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8): e1003118.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research 43 (7): e47.

Seshagiri, Somasekar, Eric W Stawiski, Steffen Durinck, Zora Modrusan, Elaine E Storm, Caitlin B Conboy, Subhra Chaudhuri, et al. 2012. “Recurrent R-spondin fusions in colon cancer.” Nature 488 (7413): 660–64.

Wu, Thomas D, and Serban Nacu. 2010. “Fast and SNP-tolerant detection of complex variants and splicing in short reads.” Bioinformatics 26 (7): 873–81.

Analyzing splice events from RNA-seq data with SGSeq

Leonard D Goldstein

22 June 2016

Abstract

Package version: SGSeqBioC2016 0.0.1

Contents