---
title: "SGSeq"
author: "Leonard D Goldstein"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('SGSeqBioC2015')`"
abstract: >
  SGSeq
vignette: >
  %\VignetteIndexEntry{SGSeq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document:
    toc: true
---

```{r, echo = FALSE, results = 'hide'}
library(knitr)
opts_chunk$set(error = FALSE)
```

```{r style, echo = FALSE, results = 'asis'}
##BiocStyle::markdown()
```

# Splice event detection and quantification from RNA-seq data with *SGSeq*

Leonard D Goldstein [1, 2]

[1] Department of Bioinformatics and Computational Biology,

[2] Department of Molecular Biology, 

Genentech, Inc., South San Francisco, CA, USA.

## Abstract

The *SGSeq* package provides a framework for analyzing annotated and novel splice events from RNA-seq data. *SGSeq* predicts exons and splice junctions from reads aligned against a reference genome and assembles them into a genome-wide splice graph. Splice events are identified from the graph and quantified using reads spanning event boundaries. This workshop provides an introduction to *SGSeq* functionality, including splice event detection, quantification, annotation and visualization. The first part of the workshop illustrates a complete *SGSeq* workflow for analyzing a gene of interest starting from BAM files. The second part of the workshop covers exercises based on a genomewide data set previously processed with *SGSeq*. 

## Preliminaries

```{r, message = FALSE}
library(SGSeq)
```

The first part of this workshop illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in [Seshagiri et al. 2012] (#seshagiri). For the purpose of this vignette, we created BAM files that only include reads mapping to a single gene of interest (*FBXO31*).

When starting a new project, *SGSeq* requires information on the samples to be analyzed. This information can be provided as a *data.frame*, which must include columns *sample_name* (specificying a unique name for each sample) and *file_bam* (specifying the location of the BAM file). Function *getBamInfo* can be used to extract additional library information from BAM files, including paired-end status, median read length, median insert size and the total number of read alignments. This information must be obtained once initially and can then be used for all subsequent analyses. It is essential that BAM files are generated using a splice-aware alignment program that generates the custom tag 'XS' indicating the direction of transcription for spliced reads. In the following, we work with a *data.frame* *si* generated from the original (complete) BAM files with function *getBamInfo*. 

```{r}
si
``` 

The following code block sets the correct BAM file paths in the sample information for this vignette.

```{r}
path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)
``` 

## Transcript features and the *TxFeatures* class

We use the UCSC knownGene table as reference annotation, which is available as a *Bioconductor* annotation package `r Biocannopkg("TxDb.Hsapiens.UCSC.hg19.knownGene")`. We retain transcripts on chromosome 16, where the *FBXO31* gene is located, and change chromosome names in the annotation to match chromosome names in the BAM files.

```{r, message = FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"
``` 

To work with transcript annotation in the *SGSeq* framework, we first extract exons and splice junctions from the *TxDb* object using function *convertToTxFeatures*. We only retain features overlapping the *FBXO31* gene (genomic coordinates of the *FBXO31* gene are stored in the *GRanges* object *gr*).

```{r}
txf_ucsc <- convertToTxFeatures(txdb)
txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
txf_ucsc
```

*SGSeq* makes extensive use of the *Bioconductor* infrastructure for genomic ranges ([Lawrence et al. 2013] (#lawrence)). The *TxFeatures* class shown above extends the *GRanges* class with additional metadata columns. Column *type* can take values 

* *J* (splice junction)
* *I* (internal exon)
* *F* (first/5$^\prime$-terminal exon)
* *L* (last/3$^\prime$-terminal exon)
* *U* (unspliced).

Columns *txName* and *geneName* indicate the transcript and gene each feature derives from. Note that a feature can belong to more than one transcript. Accordingly these columns can store multiple values for each feature.

Metadata columns can be accessed using accessor functions named after the columns they access (e.g. use function *type* to obtain feature type). 

If transcript annotation is not available as a *TxDb* object, function *convertToTxFeatures* can construct *TxFeatures* from a *GRangesList* of exons grouped by transcript (see [Exercises] (#exercises) below).

## Splice graph features and the *SGFeatures* class

Exons stored as *TxFeatures* can be overlapping (e.g. due to alternative splice sites) resulting in ambiguities (e.g. when when trying to assign reads to individual exons). We therefore partition exonic regions into disjoint exon bins. Splice junctions and disjoint exon bins uniquely determine a genome-wide splice graph ([Heber et al. 2002] (#heber)). To store splice graph features, *SGSeq* implements the *SGFeatures* class. 

```{r}
sgf_ucsc <- convertToSGFeatures(txf_ucsc)
sgf_ucsc
``` 

Similar to *TxFeatures*, *SGFeatures* extends the *GRanges* class with additional metadata columns. Column *type* for an *SGFeatures* object takes values

* *J* (splice junction)
* *E* (disjoint exon bin)
* *D* (splice donor site)
* *A* (splice acceptor site).

By convention, splice donor and acceptor sites correspond to exonic positions immediately upstream and downstream of the intron, respectively. Note that splice sites are redundant in the sense that they are determined by the splice junctions included in the *SGFeatures* object. When assigning read counts to each feature (see below), counts for exons and splice junctions are based on structurally compatible reads. In the case of splice donor and acceptor sites, counts indicate the number of reads that extend across the spliced boundary (i.e. overlapping the splice site, as well as the flanking intronic position). Splice sites are included in the *SGFeatures* object since splice site counts are subsequently used for splice variant quantification.  

*SGFeatures* includes additional metadata columns not included in *TxFeatures*. *spliced5p* and *spliced3p* indicate whether exon bins have a mandatory splice at the 5$^\prime$ and 3$^\prime$ boundary, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, as well as to determine whether an exon bin is consistent with an annotated transcript. 

Column *featureID* provides a unique identifier for each feature, while columnn *geneID* indicates the unique connected component of the splice graph a feature belongs to. 

Both *TxFeatures* and *SGFeatures* objects can be exported to BED files using function *exportFeatures*.

## Analysis based on annotated transcripts

We can now start analyzing the RNA-seq data at the *FBXO31* gene locus. We first perform an analysis based on annotated transcripts. The following example converts the transcript features into splice graph features and obtains counts of compatible RNA-seq reads for each feature and each sample. 

```{r}
sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
``` 

*analyzeFeatures* returns an object of class *SGFeatureCounts*, which extends the *RangedSummarizedExperiment* class from the `r Biocpkg("SummarizedExperiment")` package. *SGFeatureCounts* contains the sample information as *colData*, splice graph features as *rowRanges* and assays *counts* and *FPKM*, which store structurally compatible counts and FPKMs, respectively. Accessor functions *colData*, *rowRanges*, *counts* and *FPKM* can be used to access the data.

Compatible FPKMs for splice graph features can be visualized with function *plotFeatures*. *plotFeatures* generates a two-panel figure with a splice graph shown in the top panel and a heatmap illustrating expression levels of individual features in the bottom panel. For customization of *plotFeatures* output, see section [Visualization] (#visualization). The plotting function invisibly returns a *data.frame* with information on splice graph features, including genomic coordinates.

```{r figure-1, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_ucsc, geneID = 1)
df
``` 

Note that the splice graph derived from annotated transcripts includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in the samples in our data set.

## Analysis based on *de novo* prediction

Instead of relying on existing annotation, *SGSeq* can predict features from BAM files directly. The following code block predicts splice graph features with read evidence in our data set.

```{r}
sgfc_pred <- analyzeFeatures(si, which = gr)
``` 

For interpretability, we annotate predicted features with respect to transcripts included in the UCSC knownGene table. The *annotate* function assigns compatible transcripts to each feature and stores them in metadata column *txName*. Metadata column *geneName* behaves transitively, meaning all features belonging to the same connected component of the splice graph (with identical *geneID*) have the same value for *geneName*. This behavior makes it easy to identify unannotated features (with empty *txName*) that belong to an annotated gene (non-empty *geneName*).

```{r}
sgfc_pred <- annotate(sgfc_pred, txf_ucsc)
``` 

Predicted splice graph features and compatible FPKMs can be visualized as previously. Splice graph features with missing annotation can be highlighted using argument *color_novel*.

```{r figure-2, fig.width=4.5, fig.height=4.5}
df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")
df
``` 

Note that most exons and splice junctions predicted from the RNA-seq data are consistent with transcripts in the UCSC knownGene table (shown in gray). However, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed in our data set. Also, an unannotated exon (E3, shown in red) was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples (N2, N3, N4). 

## Analysis of predicted splice variants

Instead of considering the complete splice graph of a gene, we can focus our analysis on individual splice events. In the *SGSeq* framework, the splice graph is a directed acyclic graph with nodes corresponding to transcript starts, ends and splice sites, and edges corresponding to disjoint exon bins and splice junctions, directed from 5$^\prime$ to the 3$^\prime$ end. A splice event is defined by a start node and an end node connected by two or more paths and no intervening nodes with all paths intersecting. *SGSeq* identifies splice events recursively from the graph, and estimates relative usage of splice variants based on compatible reads spanning the event boundaries. The following example identifies splice events from the splice graph and obtains representative counts for each splice variant.

```{r}
sgvc_pred <- analyzeVariants(sgfc_pred)
sgvc_pred
``` 

*analyzeVariants* returns an *SGVariantCounts* object. Similar to *SGFeatureCounts*, *SGVariantCounts* extends the *RangedSummarizedExperiment* class. *SGVariantCounts* contains sample information as *colData* and *SGVariants* as *rowRanges*. Assay *variantFreq* stores estimates of relative usage for each splice variant and sample. Accessor functions *colData*, *rowRanges* and *variantFreq* can be used to access the data. Information on splice variants is stored in metadata columns in the *SGVariants* object and can be accessed as follows.

```{r}
mcols(sgvc_pred)
``` 

* *from* and *to* indicate the variant start and end node, respectively, *from* nodes are splice donor sites (*D*) or transcript starts (*S*), while *to* nodes are splice acceptor sites (*A*) or transcript ends (*E*)
* metadata columns *type* and *featureID* describe the variant in terms of the splice graph features that make up the variant
* *segmentID* are unique identifiers labelling unbranched segments of the splice graph (internal use only)
* *closed5p* indicates whether the nodes belonging to a splice variant can be reached from nodes outside of the splice event exclusively through the *from* node
* *closed3p* indicates whether the nodes belonging to a splice variant can reach nodes outside of the splice event exclusively through the *to* node
* *eventID* and *variantID* are unique identifiers for each splice event and splice variant, respectively
* *featureID5p* and *featureID3p* indicate representative features that are used for obtaining relative usage estimates at the variant start and end, respectively
* *variantType* indicates whether a splice variant is consistent with a canonical splice event (for a list of possible values, see the manual page for *annotateSGVariants*)
* *variantName* provides a unique identifier for each splice variant that is intended to be more human-readable than the numeric identifier stored in *variantID* (for details, see the manual page for *makeVariantNames*)

Splice variants and estimates of relative usage can be visualized with function *plotVariants*.

```{r figure-3, fig.width=1.5, fig.height=4.5}
plotVariants(sgvc_pred, eventID = 1, color_novel = "red")
``` 

*plotVariants* generates a two-panel figure similar to *plotFeatures*. The splice graph in the top panel illustrates the selected splice event. In this example, the splice event consists of two splice variants that correspond to a skip or inclusion of the unannotated exon. The heatmap illustrates estimates of relative usage for each splice variant. We observe that samples N2, N3 and N4 show evidence for both transcripts that include the exon as well as transcripts that skip the exon. The remaining samples show little evidence for exon inclusion.

## Visualization

Functions *plotFeatures* and *plotVariants* support many options for customizing figures. Note that the splice graph in the top figure panel is plotted by function *plotSpliceGraph*, which can be called directly. 

*plotFeatures* includes multiple alternative arguments for selecting features to be displayed. The following code illustrates three different ways of selecting and plotting the splice graph and expression levels for *FBXO31* (Entrez ID 79791).

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1)
plotFeatures(sgfc_pred, geneName = "79791")
plotFeatures(sgfc_pred, which = gr)
```

By default, the heatmap generated by *plotFeatures* displays splice junctions. Alternatively, exon bins, or both exon bins and splice junctions can be displayed.

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, include = "junctions")
plotFeatures(sgfc_pred, geneID = 1, include = "exons")
plotFeatures(sgfc_pred, geneID = 1, include = "both")
```

Argument *toscale* controls which parts of the gene model are drawn to scale. 

```{r, eval = FALSE}
plotFeatures(sgfc_pred, geneID = 1, toscale = "gene")
plotFeatures(sgfc_pred, geneID = 1, toscale = "exon")
plotFeatures(sgfc_pred, geneID = 1, toscale = "none")
```

Heatmaps allow the visualization of expression values summarized for splice junctions and exon bins. Alternatively, per-base read coverages and splice junction counts can be visualized with function *plotCoverage*.

```{r, figure-4, fig.width=4.5, fig.height=4.5}
par(mfrow = c(5, 1), mar = c(1, 3, 1, 1))
plotSpliceGraph(rowRanges(sgfc_pred), geneID = 1, toscale = "none", color_novel = "red")
for (j in 1:4) {
  plotCoverage(sgfc_pred[, j], geneID = 1, toscale = "none")
}
```

## Advanced use

Functions *analyzeFeatures* and *analyzeVariants* wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the previous analysis using *de novo* prediction can be performed as follows.

```{r}
txf <- predictTxFeatures(si, gr)
sgf <- convertToSGFeatures(txf)
sgf <- annotate(sgf, txf_ucsc)
sgfc <- getSGFeatureCounts(si, sgf)
sgv <- findSGVariants(sgf)
sgvc <- getSGVariantCounts(sgv, sgfc)
``` 

*predictTxFeatures* and *getSGFeatureCounts* can be run on individual samples (e.g. for distribution across a high-performance computing cluster). *predictTxFeatures* predicts features for each sample, merges features across samples and finally performs filtering and processing of predicted terminal exons. When using *predictTxFeatures* for individual samples, with predictions intended to be merged at a later point in time, run *predictTxFeatures* with argument *min_overhang = NULL* to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions *mergeTxFeatures* and *processTerminalExons*, respectively.

## Exercises

**Exercise 1** Construct a *TxFeatures* object for a transcript with three exons. What happens if you add a second transcript with exons that are shared or overlapping with exons in the first transcript? What happens if you convert the *TxFeatures* object to an *SGFeatures* object? 

```{r, eval = FALSE}
tx_1 <- GRangesList(tx_1 = GRanges("1", IRanges(c(101, 301, 501), c(200, 400, 600)), "+"))
tx_2 <- GRangesList(tx_2 = GRanges("1", IRanges(c(101, 351, 701), c(200, 400, 800)), "+"))
txf_1 <- convertToTxFeatures(tx_1)
txf_2 <- convertToTxFeatures(tx_2)
txf <- convertToTxFeatures(c(tx_1, tx_2))
sgf <- convertToSGFeatures(txf)
par(mfrow = c(1, 1))
plotSpliceGraph(sgf)
```

The following exercises are based on genome-wide predictions obtained from paired-end RNA-seq data generated as part of the Illumina Body Map 2.0 ([Farrell et al. 2014] (#farrell)). The data were processed as shown in the following code block (the code will not run as BAM files are not available).

```{r, eval = FALSE}
sgfc_IBM <- analyzeFeatures(si_IBM, alpha = 5, psi = 0.2, beta = 0.2, gamma = 0.2)
sgvc_IBM <- analyzeVariants(sgfc_IBM, min_denominator = 20)
exclude <- eventID(sgvc_IBM)[!closed5p(sgvc_IBM) | !closed3p(sgvc_IBM)]
sgvc_IBM <- sgvc_IBM[!eventID(sgvc_IBM) %in% exclude, ]
```

We load the previously obtained *SGSeq* predictions *sgfc_IBM* and *sgvc_IBM*.

```{r, eval = FALSE}
data(sgfc_IBM, sgvc_IBM, package = "SGSeqBioC2015")
```

**Exercise 2** Annotate the *SGFeatureCounts* and *SGVariantCounts* objects with respect to transcripts included in the UCSC knownGene table (this may take a few minutes). 

```{r, eval = FALSE} 
txdb <- restoreSeqlevels(txdb)
seqlevelsStyle(txdb) <- "NCBI"
txdb <- keepSeqlevels(txdb, c(1:22, "X", "Y"))
txf_ucsc <- convertToTxFeatures(txdb)
sgfc_IBM <- annotate(sgfc_IBM, txf_ucsc)
sgvc_IBM <- annotate(sgvc_IBM, txf_ucsc)
```

**Exercise 3** Plot the gene model for gene *KIFAP3* (Entrez ID 22920). Which parts of the predicted gene model are unannotated and what tissues are they expressed in? Inspect expression levels for both splice junctions and exons.

```{r, eval = FALSE} 
plotFeatures(sgfc_IBM, geneName = "22920", color_novel = "red")
plotFeatures(sgfc_IBM, geneName = "22920", color_novel = "red", include = "exons")
```

**Exercise 4** Plot the unannotated splice event for gene *KIFAP3*.

```{r, eval = FALSE} 
mcols(sgvc_IBM)[any(geneName(sgvc_IBM) == "22920"), ]
plotVariants(sgvc_IBM, eventID = 552, color_novel = "red")
```

**Exercise 5** (Difficult) Find genes with the most highly expressed unannotated cassette exons. Can you interpret the predicted splice graph for the top genes?

```{r, eval = FALSE}
selected <- which(elementLengths(txName(sgvc_IBM)) == 0 & any(variantType(sgvc_IBM) == "SE:I"))
variants <- rowRanges(sgvc_IBM)[selected]
features <- unlist(variants)
exons <- features[type(features) == "E"]
exons_FPKM <- FPKM(sgfc_IBM)[match(featureID(exons), featureID(sgfc_IBM)), ]
exons_FPKM_max <- apply(exons_FPKM, 1, max)
geneIDs <- geneID(exons)[order(exons_FPKM_max, decreasing = TRUE)]
plotFeatures(sgfc_IBM, geneID = geneIDs[1], color_novel = "red")
```

## References

Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, Conboy CB, Chaudhuri S, Guan Y, Janakiraman V, Jaiswal BS, Guillory J, Ha C, Dijkgraaf GJP, Stinson J, Gnad F, Huntley MA, Degenhardt JD, Haverty PM, Bourgon R, Wang W, Koeppen H, Gentleman R, Starr TK, Zhang Z, Largaespada DA, Wu TD, de Sauvage FJ. "Recurrent R-spondin fusions in colon cancer." *Nature* 488, 660--664, 2012.

Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. "Software for Computing and Annotating Genomic Ranges." *PLOS Computational Biology* 9, e1003118, 2013.

Heber S, Alekseyev M, Sze S, Tang H, Pevzner PA. "Splicing graphs and EST assembly problem." *Bioinformatics* 18 Suppl 1, S181--188, 2002.

Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD. "Current status and new features of the Consensus Coding Sequence database". *Nucleic Acids Research* 42(Database issue):D865-72, 2014.

## Session information

```{r}
sessionInfo()
```