Warning: Due to space limitations, we demonstrate the useful functions in this workflow without images. In order to see an HTML report generated based on the full genome annotations and complete datasets described in the vignette, please see here.
For the most up-to-date functionality, usage and installation instructions, and example outputs, see our github repository here.
The first release of RCAS was designed to produce quality controls and exploratory plots for the analysis of transcriptome-wide regions of interest with regard to their feature specific distributions, sequence motif signals, and biological function enrichments. However, this workflow didn’t include functions to directly compare multiple samples. However, it may be often desirable to see how the biological content of one sample relates to the content obtained from different samples. For instance, one can design a CLIP-seq based experiment with multiple conditions each containing multiple biological replicates and want to address the following common questions:
Since the release of RCAS 1.4.0, it is possible to process more than one input BED files and generate a self-containing HTML report with plots and tables that compare/contrast the input samples.
In this vignette, we will use example datasets from CLIP based sequencing experiments to show how to use the meta-analysis functionality of the RCAS package. However, the same workflow can be applied to any kind of high-throughput dataset that pertains to transripts and can be expressed as genomic ranges in a BED format file.
When starting from scratch, the two necessary inputs are:
In this vignette, we will analyse the peak regions discovered via CLIP-sequencing experiments of the RNA-binding protein FUS by Nakaya et al, 2013, Synaptic Functional Regulator FMR1 by Ascano et al. 2012, and Eukaryotic initiation factor 4A-III by Sauliere et al, 2012. We use two replicates from both experiments and constrain the analysis to a small portion of the genome (the first 1 million base pairs of chromosome 1). Both the genome annotations and the peaks detected within these regions are available as built-in data in the RCAS package. To obtain the full datasets, please refer to the doRiNa database.
First, we detect the paths to these datasets:
FUS_rep1_path <- system.file('extdata', 'FUS_Nakaya2013c_hg19.bed', package = 'RCAS') FUS_rep2_path <- system.file('extdata', 'FUS_Nakaya2013d_hg19.bed', package = 'RCAS') FMR1_rep1_path <- system.file('extdata', 'FMR1_Ascano2012a_hg19.bed', package = 'RCAS') FMR1_rep2_path <- system.file('extdata', 'FMR1_Ascano2012b_hg19.bed', package = 'RCAS') EIF4A3_rep1_path <- system.file('extdata', 'EIF4A3Sauliere20121a.bed', package = 'RCAS') EIF4A3_rep2_path <- system.file('extdata', 'EIF4A3Sauliere20121b.bed', package = 'RCAS')
Then, we create a file containing the sample names and the locations of the BED files for each sample.
projData <- data.frame('sampleName' = c('FUS_1', 'FUS_2', 'FMR1_1', 'FMR1_2', 'EIF4A3_1', 'EIF4A3_2'), 'bedFilePath' = c(FUS_rep1_path, FUS_rep2_path, FMR1_rep1_path, FMR1_rep2_path, EIF4A3_rep1_path, EIF4A3_rep2_path), stringsAsFactors = FALSE) projDataFile <- file.path(getwd(), 'myProjDataFile.tsv') write.table(projData, projDataFile, sep = '\t', quote =FALSE, row.names = FALSE)
The second input is the path to the GTF file containing genome annotations. Here we use the first 1 million bases of the chromosome 1 from the Ensembl Database (version 75 - corresponds to GRCh37 build - hg19 in UCSC).
In order to avoid redundant preprocessing of the same input files, we have developed a function
RCAS::createDB to save all preprocessed data into an SQLite database using the R package RSQLite. With this function, all input files from the project data file are processed, the resulting processed data are saved in the following SQL tables:
gtfFilePathargument. The fields of this table are: seqnames, start, end, width, strand, source, type, score, gene_id, gene_name, transcript_id, exon_id, and exon_number.
To create a database with preprocessed data, use
RCAS::createDB() function. When starting a fresh database, we can provide a file name that does not exist at the corresponding folder must be provided. If a database already exists at the given location, an error will be returned.
However, it is also possible to update an existing database by setting the
update argument to
TRUE. This is useful when you want to add new datasets to an existing project. This will only add processed data for samples that don’t already exist in the database. This will not attempt to overwrite existing data for existing samples.
createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'hg19', update = TRUE)
It is also possible to turn off certain analysis modules by setting
motifAnalysis modules to
FALSE. For instance, the following command will create the same database except for the
createDB(dbPath = databasePath, projDataFile = projDataFile, gtfFilePath = gtfFilePath, genomeVersion = 'hg19', motifAnalysis = FALSE)
If the user wishes to overwrite datasets for existing samples, all relevant data for the given samples must be first purged from the database. It is possible to do so via:
Here we erase all entries relevant to samples ‘FMR1_1’ and ‘FMR1_2’.
To have a quick look into the contents of any given database, we can use the
RCAS::summarizeDatabaseContent() function. This returns a table of number of row entries for each sample in any given table that exists in the database.
Each table of the database can be read into memory using the
RSQLite::dbReadTable() function. To read a specific table from a database, firstly a connection to the sqlite dump needs to be created.
List the tables that exist in the connection:
A table can be read into an R object:
Warning: It is important to consider that the tables read into an R object from an sqlite database are of the
data.frame class. For most tables it is the desired format, however some of the tables need converting from
data.frame to the desired class (e.g.
gtfData table) for downstream processing.
Once an sqlite database is obtained,
RCAS::runReportMetaAnalysis() function can be utilized to quickly generate comparative analysis reports between whichever sample groups are desired to be compared. To generate this report, only two inputs are required:
sampleNamecolumn. For example, replicates of the same biological condition can be assigned the same
Here is how to generate a stand-alone HTML report with interactive figures and tables from a pre-calculated RCAS database that compares CLIP datasets from two replicates of
FUS and two replicates of
Let’s first create a sample data file. The sample names found in this file should be a subset of the original project data file used to generate the sqlite database.
sampleData <- data.frame('sampleName' = c('FUS_1', 'FUS_2', 'EIF4A3_1', 'EIF4A3_2'), 'sampleGroup' = c('FUS', 'FUS', 'EIF4A3', 'EIF4A3'), stringsAsFactors = FALSE) sampleDataFile <- file.path(getwd(), 'mySampleDataTable.tsv') write.table(sampleData, sampleDataFile, sep = '\t', quote =FALSE, row.names = FALSE)
Now, we are ready to get an HTML report:
RCAS is developed in the group of Altuna Akalin (head of the Scientific Bioinformatics Platform) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.
To cite RCAS in publications use:
Bora Uyar, Dilmurat Yusuf, Ricardo Wurmus, Nikolaus Rajewsky, Uwe Ohler, Altuna Akalin; RCAS: an RNA centric annotation system for transcriptome-wide regions of interest. Nucleic Acids Res 2017 gkx120. doi: 10.1093/nar/gkx120