Abstract
This vignette is a guide containing example code for performing real-life tasks. Importantly, it covers some functionality that were not covered in the Quick-Start vignette (because they are too computationally intensive to be reproducible in a vignette). Version 1.1.5
For instructions on installing and configuring SpliceWiz, please see the Quick-Start vignette.
library(SpliceWiz)
#> Loading required package: NxtIRFdata
#> SpliceWiz package loaded with 2 threads
#> Use setSWthreads() to set the number of SpliceWiz threads
First, define the path to the directory in which the reference should be stored. This directory will be made by SpliceWiz, but its parent directory must exist, otherwise an error will be returned.
"./Reference" ref_path <-
Note that setting genome_path = "hg38"
will prompt SpliceWiz to use the default files for nonPolyA and Mappability exclusion references in the generation of its reference. Valid options for genome_path
are “hg38”, “hg19”, “mm10” and “mm9”.
buildRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "hg38"
)
buildRef()
first runs getResources()
, which prepares the genome and gene annotations by storing a compressed local copy in the resources
subdirectory of the given reference path. Specifically, a binary compressed version of the FASTA file (a.k.a. TwoBitFile), and a gzipped GTF file. If fasta
and/or gtf
are https or ftp links, the resources will be downloaded from the internet (which may take a while).
After local compressed versions of the genome and gene annotations are prepared, buildRef()
will proceed to generate the SpliceWiz reference.
Note that these two steps can be run separately. getResources()
will prepare local compressed copies of the FASTA / GTF resources without generating the SpliceWiz reference. Running buildRef()
, with reference_path
specifying where the resources were prepared previously with getResources()
, will perform the 2nd step (SpliceWiz reference generation) without needing to prepare the genome resources (in this case, set the parameters fasta = ""
and gtf = ""
).
As an example, the below steps:
getResources(
reference_path = ref_path,
fasta = "genome.fa",
gtf = "transcripts.gtf"
)
buildRef(
reference_path = ref_path,
fasta = "", gtf = "",
genome_type = "hg38"
)
is equivalent to this:
buildRef(
reference_path = ref_path,
fasta = "genome.fa",
gtf = "transcripts.gtf"
genome_type = "hg38"
)
To re-build and overwrite an existing reference, using the same resource annotations, set overwrite = TRUE
# Assuming hg38 genome:
buildRef(
reference_path = ref_path,
genome_type = "hg38",
overwrite = TRUE
)
If buildRef()
is run without setting overwrite = TRUE
, it will terminate if the file SpliceWiz.ref.gz
is found within the reference directory.
The following will first download the genome and gene annotation files from the online resource and store a local copy of it in a file cache, facilitated by BiocFileCache. Then, it uses the downloaded resource to create the SpliceWiz reference.
"ftp://ftp.ensembl.org/pub/release-94/"
FTP <-
buildRef(
reference_path = ref_path,
fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
"Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"),
gtf = paste0(FTP, "gtf/homo_sapiens/",
"Homo_sapiens.GRCh38.94.chr.gtf.gz"),
genome_type = "hg38"
)
AnnotationHub contains Ensembl references for many genomes. To browse what is available:
require(AnnotationHub)
#> Loading required package: AnnotationHub
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
AnnotationHub()
ah <-#> snapshotDate(): 2022-12-16
query(ah, "Ensembl")
#> AnnotationHub with 33668 records
#> # snapshotDate(): 2022-12-16
#> # $dataprovider: Ensembl, FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,E...
#> # $species: Homo sapiens, Mus musculus, Danio rerio, Rattus norvegicus, Pan ...
#> # $rdataclass: TwoBitFile, GRanges, EnsDb, SQLiteFile, data.frame, OrgDb, list
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH5046"]]'
#>
#> title
#> AH5046 | Ensembl Genes
#> AH5160 | Ensembl Genes
#> AH5311 | Ensembl Genes
#> AH5434 | Ensembl Genes
#> AH5435 | Ensembl EST Genes
#> ... ...
#> AH109476 | Ensembl 108 EnsDb for Xiphophorus couchianus
#> AH109477 | Ensembl 108 EnsDb for Xiphophorus maculatus
#> AH109478 | Ensembl 108 EnsDb for Xenopus tropicalis
#> AH109479 | Ensembl 108 EnsDb for Zonotrichia albicollis
#> AH109480 | Ensembl 108 EnsDb for Zalophus californianus
For a more specific query:
query(ah, c("Homo Sapiens", "release-94"))
#> AnnotationHub with 9 records
#> # snapshotDate(): 2022-12-16
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: TwoBitFile, GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH64628"]]'
#>
#> title
#> AH64628 | Homo_sapiens.GRCh38.94.abinitio.gtf
#> AH64629 | Homo_sapiens.GRCh38.94.chr.gtf
#> AH64630 | Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
#> AH64631 | Homo_sapiens.GRCh38.94.gtf
#> AH65744 | Homo_sapiens.GRCh38.cdna.all.2bit
#> AH65745 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#> AH65746 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#> AH65747 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#> AH65748 | Homo_sapiens.GRCh38.ncrna.2bit
We wish to fetch “AH65745” and “AH64631” which contains the desired FASTA and GTF files, respectively. To build a reference using these resources:
buildRef(
reference_path = ref_path,
fasta = "AH65745",
gtf = "AH64631",
genome_type = "hg38"
)
Build-Reference-methods
will recognise the inputs of fasta
and gtf
as AnnotationHub resources if they begin with “AH”.
For human and mouse genomes, we highly recommend specifying genome_type
as the default mappability file is used to exclude intronic regions with repeat sequences from intron retention analysis. For other species, one could generate a SpliceWiz reference without this reference:
buildRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = ""
)
If one wishes to prepare a Mappability Exclusion for species other than human or mouse, please see the Calculating Mappability Exclusions using STAR
section below.
To use STAR
to align FASTQ files, one must be using a system with STAR
installed. This software is not available in Windows. To check if STAR
is available:
STAR_version()
#> Dec 21 18:16:57 STAR is not installed
#> NULL
"./Reference"
ref_path =
# Ensure genome resources are prepared from genome FASTA and GTF file:
if(!dir.exists(file.path(ref_path, "resource"))) {
getResources(
reference_path = ref_path,
fasta = "genome.fa",
gtf = "transcripts.gtf"
)
}
# Generate a STAR genome reference:
STAR_BuildRef(
reference_path = ref_path,
n_threads = 8
)
Note that, by default, STAR_BuildRef
will store the STAR genome reference in the STAR
subdirectory within reference_path
. To override this setting, set the STAR_ref_path
parameter to a directory path of your choice, e.g.:
STAR_BuildRef(
reference_path = ref_path,
STAR_ref_path = "/path/to/another/directory",
n_threads = 8
)
The STAR_mappability
wrapper function will use the STAR aligner to calculate regions of low mappability within the given genome.
STAR_mappability(
reference_path = ref_path,
STAR_ref_path = file.path(ref_path, "STAR"),
map_depth_threshold = 4,
n_threads = 4,
read_len = 70,
read_stride = 10,
error_pos = 35
)
In the above example, STAR_mappability()
will use the given STAR reference (inside the STAR_ref_path
directory), and the genome found within the reference_path
SpliceWiz reference, to generate synthetic reads.
read_len
specifies the length of these synthetic reads (default 70
)read_stride
specifies the nucleotide distance between adjacent synthetic reads (default 10
). These will be generated with alternate +
/ -
stranderror_pos
introduces a single nucleotide error at the specified position (default 35
), which will generate an SNP at the center of the 70-nt synthetic read.These synthetic reads will then be aligned back to the STAR genome to create a BAM file, which is later processed to measure the coverage depth of the genome by these synthetic reads.
Finally, regions with coverage depth of map_depth_threshold
or below will be defined as regions of “low mappability”. In the above example, 70-nt reads of 10-nt stride will produce synthetic reads such that each nucleotide is expected to have a coverage of 70 / 10 = 7
nucleotides. A coverage of 4
nucleotides or less equates to a coverage of < ~60% of expected depth.
If STAR
is available on the same computer or server where R/RStudio is being run, we can use the one-line function buildFullRef
. This function will:
getResources
)STAR_BuildRef
)STAR_mappability
)buildRef
)buildFullRef(
reference_path = ref_path,
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "",
use_STAR_mappability = TRUE,
n_threads = 4
)
n_threads
specify how many threads should be used to build the STAR reference and to calculate the low mappability regions
If STAR
is not available, Rsubread
is available on Bioconductor for alignment and can be used to perform mappability calculations. The example code in the manual is displayed here for convenience, to demonstrate how this would be done:
require(Rsubread)
# (1a) Creates genome resource files
file.path(tempdir(), "Reference")
ref_path <-
getResources(
reference_path = ref_path,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
# (1b) Systematically generate reads based on the SpliceWiz example genome:
generateSyntheticReads(
reference_path = ref_path
)
# (2) Align the generated reads using Rsubread:
# (2a) Build the Rsubread genome index:
file.path(ref_path, "Rsubread")
subreadIndexPath <-if(!dir.exists(subreadIndexPath)) dir.create(subreadIndexPath)
::buildindex(
Rsubreadbasename = file.path(subreadIndexPath, "reference_index"),
reference = chrZ_genome()
)
# (2b) Align the synthetic reads using Rsubread::subjunc()
::subjunc(
Rsubreadindex = file.path(subreadIndexPath, "reference_index"),
readfile1 = file.path(ref_path, "Mappability", "Reads.fa"),
output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"),
useAnnotation = TRUE,
annot.ext = chrZ_gtf(),
isGTF = TRUE
)
# (3) Analyse the aligned reads in the BAM file for low-mappability regions:
calculateMappability(
reference_path = ref_path,
aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam")
)
# (4) Build the SpliceWiz reference using the calculated Mappability Exclusions
buildRef(ref_path, genome_type = "", MappabilityRef = "")
If MappabilityRef = ""
and genome_type = ""
are set, buildRef
will automatically use the file Mappability/MappabilityExclusion.bed.gz
as its Mappability Exclusion resource. The default nonPolyARef
can be set by the getNonPolyARef()
function, e.g.:
buildRef(ref_path, genome_type = "", MappabilityRef = "",
nonPolyARef = getNonPolyARef("hg38")
)
# Aligning Raw RNA-seq data using SpliceWiz’s STAR wrappers
First, remember to check that STAR is available via command line:
STAR_version()
#> Dec 21 18:16:57 STAR is not installed
#> NULL
STAR_alignReads(
fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq",
STAR_ref_path = file.path(ref_path, "STAR"),
BAM_output_path = "./bams/sample1",
n_threads = 8,
trim_adaptor = "AGATCGGAAG"
)
Note that by default, STAR_alignReads()
will “trim” Illumina adapters (in fact they will be soft-clipped using STAR’s --clip3pAdapterSeq
option). To disable this feature, set trim_adapter = ""
in the STAR_alignReads()
function.
data.frame(
Experiment <-sample = c("sample_A", "sample_B"),
forward = file.path("raw_data", c("sample_A", "sample_B"),
c("sample_A_1.fastq", "sample_B_1.fastq")),
reverse = file.path("raw_data", c("sample_A", "sample_B"),
c("sample_A_2.fastq", "sample_B_2.fastq"))
)
STAR_alignExperiment(
Experiment = Experiment,
STAR_ref_path = file.path("Reference_FTP", "STAR"),
BAM_output_path = "./bams",
n_threads = 8,
two_pass = FALSE
)
To use two-pass mapping, set two_pass = TRUE
. We recommend disabling this feature, as one-pass mapping is adequate in typical-use cases. Two-pass mapping is recommended if one expects a large number of novel splicing events or if the gene annotations (of transcript isoforms) is likely to be incomplete.
SpliceWiz can identify sequencing FASTQ files recursively from a given directory. It assumes that forward and reverse reads are suffixed as _1
and _2
, respectively. Users can choose to identify such files using a specified file extension. For example, to recursively identify FASTQ files of the format {sample}_1.fq.gz
and {sample}_2.fq.gz
, use the following:
# Assuming sequencing files are named by their respective sample names
findFASTQ(
fastq_files <-sample_path = "./sequencing_files",
paired = TRUE,
fastq_suffix = ".fq.gz", level = 0
)
For gzipped fastq files, fastq_suffix
should be ".fq.gz"
or ".fastq.gz"
. For uncompressed fastq files, it should be ".fq"
or ".fastq"
. Please check your files in order to correctly set this option.
findFASTQ()
will return a 2- or 3-column data frame (depending if paired
was set to FALSE
or TRUE
, respectively). The first column is the sample name (the file name, if level = 0
, or the parent directory name, if level = 1
). The subsequent columns are the paths of the forward and reverse reads.
The data.frame returned by the findFASTQ()
function can be parsed into the STAR_alignExperiment
function. This will align all samples contained in the data.frame parsed via the Experiment
parameter.
STAR_alignExperiment(
Experiment = fastq_files,
STAR_ref_path = file.path("Reference_FTP", "STAR"),
BAM_output_path = "./bams",
n_threads = 8,
two_pass = FALSE
)
Note that, if a directory contains multiple forward and reverse FASTQ files, they will be aligned to the same BAM file. This can be done by setting level = 1
in the findFASTQ()
function, resulting in multiple rows with the same sample name.
# Running processBAM on BAM files
To conveniently find all BAM files recursively in a given path:
findBAMS("./bams", level = 1) bams <-
This convenience function returns the putative sample names, either from BAM file names themselves (level = 0
), or from the names of their parent directories (level = 1
).
First, ensure that a SpliceWiz reference has been generated using the buildRef()
function. This reference should be parsed into the reference_path
parameter of the processBAM()
function.
To run processBAM()
using 4 OpenMP threads:
# assume SpliceWiz reference has been generated in `ref_path` using the
# `buildRef()` function.
processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = "./pb_output",
n_threads = 4,
useOpenMP = TRUE
)
` ### Creating COV files from BAM files without running processBAM
Sometimes one may wish to create a COV file from a BAM file without running processBAM()
. One reason might be because a SpliceWiz reference is not available.
To convert a list of BAM files, run BAM2COV()
. This is a function structurally similar to processBAM()
but without the need to give the path to the SpliceWiz reference:
BAM2COV(
bamfiles = bams$path,
sample_names = bams$sample,
output_path = "./cov_output",
n_threads = 4,
useOpenMP = TRUE
)
Note that, by default, processBAM
and BAM2COV
will use OpenMP where available (which is natively supported on Windows and Linux). For MacOS, if OpenMP is not available, these functions will use BiocParallel’s MulticoreParam
to multi-thread process BAM files (1 BAM per thread). Beware that this may take a lot of RAM! (Typically 5-10 Gb per BAM file). We highly suggest considering installing OpenMP libraries on MacOS, as this will lower RAM usage.
# Collating the experiment
Assuming the SpliceWiz reference is in ref_path
, after running processBAM()
as shown in the previous section, use the convenience function findSpliceWizOutput()
to tabulate a list of samples and their corresponding processBAM()
outputs:
findSpliceWizOutput("./pb_output") expr <-
This data.frame can be directly used to run collateData
:
collateData(
Experiment = expr,
reference_path = ref_path,
output_path = "./NxtSE_output"
)
novelSplicing = TRUE
. See the Quick-Start vignette for more details about the various parameters associated with novel splicing detection.collateData(
Experiment = expr,
reference_path = ref_path,
output_path = "./NxtSE_output_novelSplicing",
novelSplicing = TRUE
)
Then, the collated data can be imported as a NxtSE
object, which is an object that inherits SummarizedExperiment
and has specialized containers to hold additional data required by SpliceWiz.
makeSE("./NxtSE_output") se <-
Please refer to SpliceWiz: Quick-Start vignette for worked examples using the example dataset.
sessionInfo()
#> R Under development (unstable) (2022-12-10 r83428)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] AnnotationHub_3.7.0 BiocFileCache_2.7.1 dbplyr_2.2.1
#> [4] BiocGenerics_0.45.0 SpliceWiz_1.1.5 NxtIRFdata_1.5.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.4
#> [3] magrittr_2.0.3 rmarkdown_2.19
#> [5] fs_1.5.2 BiocIO_1.9.1
#> [7] zlibbioc_1.45.0 vctrs_0.5.1
#> [9] memoise_2.0.1 Rsamtools_2.15.0
#> [11] DelayedMatrixStats_1.21.0 RCurl_1.98-1.9
#> [13] webshot_0.5.4 progress_1.2.2
#> [15] htmltools_0.5.4 curl_4.3.3
#> [17] Rhdf5lib_1.21.0 rhdf5_2.43.0
#> [19] shinyFiles_0.9.3 sass_0.4.4
#> [21] bslib_0.4.2 htmlwidgets_1.6.0
#> [23] plotly_4.10.1 cachem_1.0.6
#> [25] GenomicAlignments_1.35.0 iterators_1.0.14
#> [27] mime_0.12 lifecycle_1.0.3
#> [29] pkgconfig_2.0.3 Matrix_1.5-3
#> [31] R6_2.5.1 fastmap_1.1.0
#> [33] GenomeInfoDbData_1.2.9 MatrixGenerics_1.11.0
#> [35] shiny_1.7.4 digest_0.6.31
#> [37] colorspace_2.0-3 AnnotationDbi_1.61.0
#> [39] S4Vectors_0.37.3 GenomicRanges_1.51.4
#> [41] RSQLite_2.2.19 seriation_1.4.0
#> [43] filelock_1.0.2 fansi_1.0.3
#> [45] httr_1.4.4 compiler_4.3.0
#> [47] withr_2.5.0 bit64_4.0.5
#> [49] BiocParallel_1.33.7 viridis_0.6.2
#> [51] DBI_1.1.3 heatmaply_1.4.0
#> [53] dendextend_1.16.0 HDF5Array_1.27.0
#> [55] R.utils_2.12.2 rappdirs_0.3.3
#> [57] DelayedArray_0.25.0 rjson_0.2.21
#> [59] tools_4.3.0 interactiveDisplayBase_1.37.0
#> [61] httpuv_1.6.7 fst_0.9.8
#> [63] R.oo_1.25.0 glue_1.6.2
#> [65] restfulr_0.0.15 rhdf5filters_1.11.0
#> [67] promises_1.2.0.1 grid_4.3.0
#> [69] generics_0.1.3 gtable_0.3.1
#> [71] BSgenome_1.67.1 ca_0.71.1
#> [73] R.methodsS3_1.8.2 tidyr_1.2.1
#> [75] hms_1.1.2 data.table_1.14.6
#> [77] utf8_1.2.2 XVector_0.39.0
#> [79] foreach_1.5.2 BiocVersion_3.17.1
#> [81] pillar_1.8.1 stringr_1.5.0
#> [83] genefilter_1.81.0 later_1.3.0
#> [85] splines_4.3.0 dplyr_1.0.10
#> [87] lattice_0.20-45 ompBAM_1.3.0
#> [89] survival_3.4-0 rtracklayer_1.59.0
#> [91] bit_4.0.5 annotate_1.77.0
#> [93] tidyselect_1.2.0 registry_0.5-1
#> [95] Biostrings_2.67.0 knitr_1.41
#> [97] rhandsontable_0.3.8 gridExtra_2.3
#> [99] IRanges_2.33.0 SummarizedExperiment_1.29.1
#> [101] stats4_4.3.0 xfun_0.36
#> [103] shinydashboard_0.7.2 Biobase_2.59.0
#> [105] matrixStats_0.63.0 pheatmap_1.0.12
#> [107] DT_0.26 stringi_1.7.8
#> [109] lazyeval_0.2.2 yaml_2.3.6
#> [111] shinyWidgets_0.7.5 evaluate_0.19
#> [113] codetools_0.2-18 tibble_3.1.8
#> [115] BiocManager_1.30.19 cli_3.5.0
#> [117] xtable_1.8-4 munsell_0.5.0
#> [119] jquerylib_0.1.4 Rcpp_1.0.9
#> [121] GenomeInfoDb_1.35.8 png_0.1-8
#> [123] XML_3.99-0.13 parallel_4.3.0
#> [125] fstcore_0.9.12 ellipsis_0.3.2
#> [127] ggplot2_3.4.0 assertthat_0.2.1
#> [129] blob_1.2.3 prettyunits_1.1.1
#> [131] sparseMatrixStats_1.11.0 bitops_1.0-7
#> [133] viridisLite_0.4.1 scales_1.2.1
#> [135] purrr_1.0.0 crayon_1.5.2
#> [137] rlang_1.0.6 TSP_1.2-1
#> [139] KEGGREST_1.39.0