Introduction

SpliceWiz is a graphical interface for differential alternative splicing and visualization in R. It differs from other alternative splicing tools as it is designed for users with basic bioinformatic skills to analyze datasets containing up to hundreds of samples! SpliceWiz contains a number of innovations including:

  • Super-fast handling of alignment BAM files using ompBAM, our developer resource for multi-threaded BAM processing,
  • Alternative splicing event (ASE) filters to remove problematic ASEs from analysis
  • Group-averaged coverage plots: publication-ready figures to clearly visualize differential alternative splicing between biological / experimental conditions
  • Interactive figures, including scatter and volcano plots, gene ontology (GO) analysis, heatmaps, and scrollable coverage plots, powered using the shinyDashboard interface

This vignette is a runnable working example of the SpliceWiz workflow. The purpose is to quickly demonstrate the basic functionalities of SpliceWiz.

We provide here a brief outline of the workflow for users to get started as quickly as possible. However, we also provide more details for those wishing to know more. Many sections will contain extra information that can be displayed when clicked on, such as these:

Click on me for more details
In most sections, we offer more details about each step of the workflow, that can be revealed in text segments like this one. Be sure to click on buttons like these, where available.


FAQ

What are the system memory requirements for running SpliceWiz
We recommend the following memory requirements (RAM) for running various steps of SpliceWiz:

buildRef()

  • Building human / mouse SpliceWiz reference: 8 gigabytes

processBAM()

  • Processing small alignment (BAM) files (~20 million paired end reads): 8 gigabytes
  • Processing large BAM files (~100 million paired end reads): 16 gigabytes

collateData()

  • Collating routine experiments (e.g. 3 replicates, 2 conditions): 8-16 gigabytes
  • Collating large experiments (20+ samples, using lowMemoryMode=TRUE): 32 gigabytes
  • Collating large experiments (20+ samples, using lowMemoryMode=FALSE): 8 gigabytes per thread

Differential analysis

  • Differential analysis (routine experiments): 8 gigabytes
  • Differential analysis (large experiments - 20+ samples): 16 gigabytes
  • DESeq2-based differential analysis of large experiments: 32 gigabytes


How does SpliceWiz measure alternative splicing?

SpliceWiz defines alternative splicing events (ASEs) as binary events between two possibilities, the included and excluded isoform. It detects and measures: skipped (casette) exons (SE), mutually-exclusive exons (MXE), alternative 5’/3’ splice site usage (A5SS / A3SS), alternate first / last exon usage (AFE / ALE), and retained introns (IR or RI).

SpliceWiz uses splice-specific read counts to measure ASEs. Namely, these are junction reads (reads that align across splice sites). The exception is intron retention (IR) whereby the (trimmed) mean read depth across the intron is measured (identical to the method used in IRFinder).

SpliceWiz provides two metrics:

  • Percent spliced in (PSI): is the expression of the included isoform as a proportion of both included/excluded isoform. PSIs are measured for all types of alternative splicing, including annotated retained introns (RI)
  • IR-ratio: For introns, we also measure IR-ratios, which is the expression of IR-transcript as a proportion of IR- and spliced-transcripts. Spliced transcript expression is measured using either SpliceOver or SpliceMax method (the latter is identical to that used in IRFinder)


Does SpliceWiz detect novel splicing events?
Novel splicing events are those in which at least one isoform is not an annotated transcript in the given gene annotation. SpliceWiz DOES detect novel splicing events.

It detects novel events by using novel junctions, using pairs of junctions that originate from or terminate at a common coordinate (novel alternate splice site usage).

Additionally, SpliceWiz detects “tandem junction reads”. These are reads that span across two or more splice junctions. The region between splice junctions can then be annotated as novel exons (if they are not identical to annotated exons). These novel exons can then be used to measure novel casette exon usage.


Workflow from a glance

The basic steps of SpliceWiz are as follows:

  • Building the SpliceWiz reference
  • Process BAM files using SpliceWiz
  • Collate results of individual samples into an experiment
  • Importing the collated experiment as an NxtSE object
  • Alternative splicing event filtering
  • Differential ASE analysis
  • Visualization

Quick-Start

Installation

To install SpliceWiz, start R (devel version) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SpliceWiz")

Setting up a Conda environment to use SpliceWiz
For those wishing to set up a self-contained environment with SpliceWiz installed (e.g. on a high performance cluster), we recommend using miniconda. For installation instructions, see the documentation on how to install miniconda

After installing miniconda, create a conda environment as follows:

conda create -n envSpliceWiz python=3.9

After following the prompts, activate the environment:

conda activate envSpliceWiz

Next, install R 4.2.1 as follows:

conda install -c conda-forge r-base=4.2.1

NB: We have not been able to successfuly use r-base=4.3, so we recommend using r-base=4.2.1 (until further notice).

Many of SpliceWiz’s dependencies are up-to-date from the conda-forge channel, so they are best installed via conda:

conda install -c conda-forge r-devtools r-essentials r-xml r-biocmanager \
  r-fst r-plotly r-rsqlite r-rcurl

After this is done, the remainder of the packages need to be installed from the R terminal. This is because most Bioconductor packages are from the bioconda channel and appear not to be routinely updated.

So, lets enter the R terminal from the command line:

R

Set up Bioconductor 3.16 (which is the latest version compatible with R 4.2):

BiocManager::install(version = "3.16")

Again, follow the prompts to update any necessary packages.

Once this is done, install SpliceWiz (devel) from github:

# BiocManager::install("SpliceWiz")
devtools::install_github("alexchwong/SpliceWiz")

The last step will install any remaining dependencies, taking approximately 20-30 minutes depending on your system.


Enabling OpenMP (multi-threading) for MacOS users (Optional)
For MacOS users, make sure OpenMP libraries are installed correctly. We recommend users follow this guide, but the quickest way to get started is to install libomp via brew:

brew install libomp


Installing statistical package dependencies (Optional)
SpliceWiz uses established statistical tools to perform alternative splicing differential analysis:

  • limma: models included and excluded counts as log-normal distributions
  • DESeq2: models included and excluded counts as negative binomial distributions
  • edgeR: models included and excluded counts as negative binomial distributions. SpliceWiz uses the quasi-likelihood method which deals better with zero-counts.
  • DoubleExpSeq: models included and excluded counts using beta binomial distributions

To install all of these packages:

install.packages("DoubleExpSeq")

BiocManager::install(c("DESeq2", "limma", "edgeR"))


Loading SpliceWiz

library(SpliceWiz)
Details
The SpliceWiz package loads the NxtIRFdata data package. This data package contains the example “chrZ” genome / annotations and 6 example BAM files that are used in this working example. Also, NxtIRFdata provides pre-generated mappability exclusion annotations for building human and mouse SpliceWiz references


The SpliceWiz Graphics User Interface (GUI)

SpliceWiz offers a graphical user interface (GUI) for interactive users, e.g. in the RStudio environment. To start using SpliceWiz GUI:

if(interactive()) {
    spliceWiz(demo = TRUE)
}


Building the SpliceWiz reference

Why do we need the SpliceWiz reference?
SpliceWiz first needs to generate a set of reference files. The SpliceWiz reference is used to quantitate alternative splicing in BAM files, as well as in downstream collation, differential analysis and visualisation.

SpliceWiz generates a reference from a user-provided genome FASTA and genome annotation GTF file, and is optimised for Ensembl references but can accept other reference GTF files. Alternatively, SpliceWiz accepts AnnotationHub resources, using the record names of AnnotationHub records as input.


Using the example FASTA and GTF files, use the buildRef() function to build the SpliceWiz reference:

ref_path <- file.path(tempdir(), "Reference")
buildRef(
    reference_path = ref_path,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf(),
    ontologySpecies = "Homo sapiens"
)

The SpliceWiz reference can be viewed as data frames using various getter functions. For example, to view the annotated alternative splicing events (ASE):

df <- viewASE(ref_path)

See ?View-Reference-methods for a comprehensive list of getter functions

Using the GUI
After starting the SpliceWiz GUI in demo mode, click the Reference tab from the menu side bar. The following interface will be shown:

Building the SpliceWiz reference using the GUI

Building the SpliceWiz reference using the GUI

  1. The first step to building a SpliceWiz reference is to select a directory in which to create the reference.
  2. SpliceWiz provides an interface to retrieve the genome sequence (FASTA) and transcriptome annotation (GTF) files from the Ensembl FTP server, by first selecting the “Release” and then “Species” from the drop-down boxes.
  3. Alternatively, users can provide their own FASTA and GTF files.
  4. Human (hg38, hg19) and mouse genomes (mm10, mm9) have the option of further refining IR analysis using built-in mappability exclusion annotations, allowing SpliceWiz to ignore intronic regions of low mappability.
For now, to continue with the demo and create the reference using the GUI, click on the Load Demo FASTA/GTF (5), and then click Build Reference (6)

Where did the FASTA and GTF files come from?
The helper functions chrZ_genome() and chrZ_gtf() returns the paths to the example genome (FASTA) and transcriptome (GTF) file included with the NxtIRFdata package that contains the working example used by SpliceWiz:

# Provides the path to the example genome:
chrZ_genome()
#> [1] "/home/biocbuild/bbs-3.19-bioc/R/site-library/NxtIRFdata/extdata/genome.fa"

# Provides the path to the example gene annotation:
chrZ_gtf()
#> [1] "/home/biocbuild/bbs-3.19-bioc/R/site-library/NxtIRFdata/extdata/transcripts.gtf"

What is the chrZ genome?
For the purpose of generating a running example to demonstrate SpliceWiz, we created an artificial genome / gene annotation. This was created using 7 human genes (SRSF1, SRSF2, SRSF3, TRA2A, TRA2B, TP53 and NSUN5). The SRSF and TRA family of genes all contain poison exons flanked by retained introns. Additionally, NSUN5 contains an annotated IR event in its terminal intron. Sequences from these 7 genes were aligned into one sequence to create an artificial chromosome Z (chrZ). The gene annotations were modified to only contain the 7 genes with the modified genomic coordinates.

What is the gene ontology species?
SpliceWiz supports gene ontology analysis. To enable this capability, we first need to generate the gene ontology annotations for the appropriate species.

To see a list of supported species:

getAvailableGO()
#>  [1] "Anopheles gambiae"        "Arabidopsis thaliana"    
#>  [3] "Bos taurus"               "Canis familiaris"        
#>  [5] "Gallus gallus"            "Pan troglodytes"         
#>  [7] "Escherichia coli"         "Drosophila melanogaster" 
#>  [9] "Homo sapiens"             "Mus musculus"            
#> [11] "Sus scrofa"               "Rattus norvegicus"       
#> [13] "Macaca mulatta"           "Caenorhabditis elegans"  
#> [15] "Xenopus laevis"           "Saccharomyces cerevisiae"
#> [17] "Danio rerio"

Note that, if genome_type is specified to a supported genome, the human / mouse gene ontology annotation will be automatically generated.


What is Mappability and why should I care about it?
For the most part, the SpliceWiz reference can be built with just the FASTA and GTF files. This is sufficient for assessment for most forms of alternative splicing events.

For intron retention, accurate assessment of intron depth is important. However, introns contain many repetitive regions that are difficult to map. We refer to these regions as “mappability exclusions”.

We adopt IRFinder’s algorithm to identify these mappability exclusions. This is determined empirically by generating synthetic reads systematically from the genome, then aligning these reads back to the same genome. Regions that contain less than the expected coverage depth of reads define “mappability exclusions”.

See the vignette: SpliceWiz cookbook for details on how to generate “mappability exclusions” for any genome.

How do I use pre-built mappability exclusions to generate human and mouse references?
For human and mouse genomes, SpliceWiz provides pre-built mappability exclusion references that can be used to build the SpliceWiz reference. SpliceWiz provides these annotations via the NxtIRFdata package.

Simply specify the genome in the parameter genome_type in the buildRef() function (which accepts hg38, hg19, mm10 and mm9).

Additionally, a reference for non-polyadenylated transcripts is used. This has a minor role in QC of samples (to assess the adequacy of polyA capture).

For example, assuming your genome file "genome.fa" and a transcript annotation "transcripts.gtf" are in the working directory, a SpliceWiz reference can be built using the built-in hg38 low mappability regions and non-polyadenylated transcripts as follows:

## NOT RUN

ref_path_hg38 <- "./Reference"
buildRef(
    reference_path = ref_path_hg38,
    fasta = "genome.fa",
    gtf = "transcripts.gtf",
    genome_type = "hg38"
)


Process BAM files using SpliceWiz

The function SpliceWiz_example_bams() retrieves 6 example BAM files from ExperimentHub and places a copy of these in the temporary directory.

bams <- SpliceWiz_example_bams()
What are these example BAM files and how were they generated?
In this vignette, we provide 6 example BAM files. These were generated based on aligned RNA-seq BAMs of 6 samples from the Leucegene AML dataset (GSE67039). Sequences aligned to hg38 were filtered to only include genes aligned to that used to create the chrZ chromosome. These sequences were then re-aligned to the chrZ reference using STAR.

How can I easily locate multiple BAM files?
Often, alignment pipelines process multiple samples. SpliceWiz provides convenience functions to recursively locate all the BAM files in a given folder, and tries to ascertain sample names. Often sample names can be gleaned when: * The BAM files are named by their sample names, e.g. “sample1.bam”, “sample2.bam”. In this case, level = 0 * The BAM files have generic names but are contained inside parent directories labeled by their sample names, e.g. “sample1/Unsorted.bam”, “sample2/Unsorted.bam”. In this case, level = 1

# as BAM file names denote their sample names
bams <- findBAMS(tempdir(), level = 0) 

# In the case where BAM files are labelled using sample names as parent 
# directory names (which oftens happens with the STAR aligner), use level = 1


Process these BAM files using SpliceWiz:

pb_path <- file.path(tempdir(), "pb_output")
processBAM(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = pb_path
)

Using the GUI
After building the demo reference as shown in the previous section, start SpliceWiz GUI in demo mode. Then, click the Experiment tab from the menu side bar. The following interface will be shown: