Creating Tallies from within R

In this vignette we show how to create tally files from bam files within an interactive R session (note that this is not a good idea for large datasets). We will also provide examples of how to leverage parallel computing infrastructure using the BiocParallel and BatchJobs packages (which is a good idea for large datasets).

Dear Windows users, please be advised that the code in this vignette is not guaranteed to run under Windows because of an issue in the HDF5 implementation for that platform. Evaluation is disabled to avoid problems and you wil not see the plots / outputs for this vignette.

Preparations

First we load necessary packages and check for the example bam files shipped with the h5vcData package. Rsamtools is used below to read header info from the files.

library("h5vc")
library("rhdf5")
library("Rsamtools")
## Loading required package: XVector
## 
## Attaching package: 'XVector'
## 
## The following object is masked from 'package:plyr':
## 
##     compact
## 
## Loading required package: Biostrings
files <- c("NRAS.AML.bam", "NRAS.Control.bam")
bamFiles <- file.path(system.file("extdata", package = "h5vcData"), files)

These bam files contain reads from a matched pair of samples from an AML patient focusing on the region containing the gene NRAS (Chromosome 1: 115,247,090-115,259,515).

We will now create the tally file and create the groups that represent the study and chromosome we want to work on. Before we do this, we need to find out how big our datasets have to be in their genomic-position dimension, to do this we will look into the header of the bam files and extract the sequence length information.

chromdim <- sapply(scanBamHeader(bamFiles), function(x) x$targets)
colnames(chromdim) <- files
head(chromdim)
##   NRAS.AML.bam NRAS.Control.bam
## 1    249250621        249250621
## 2    243199373        243199373
## 3    198022430        198022430
## 4    191154276        191154276
## 5    180915260        180915260
## 6    171115067        171115067

Both files have the same header information and are fully compatible, so we can just pick one file and take the chromosome lengths from there. Note that although we will only tally the NRAS gene we still instantiate the datasets in the tally file with the full chromosome length so that the index along the genomic position axis corresponds directly to the position in the genome (the internal compression of HDF5 will take care of the large blocks of zeros so that the effective filesize is similar to what it would be if we created the datasets to only hold the NRAS gene region).

chrom <- "1"
chromlength <- chromdim[chrom, 1]
study <- "/NRAS"
tallyFile <- file.path(tempdir(), "NRAS.tally.hfs5")
if (file.exists(tallyFile)) {
    file.remove(tallyFile)
}
if (prepareTallyFile(tallyFile, study, chrom, chromlength, nsamples = 2)) {
    h5ls(tallyFile)
} else {
    message(paste("Preparation of:", tallyFile, "failed"))
}
##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

Now we attach the sampleData to the group.

sampleData <- data.frame(Sample = c("AML", "Control"), Column = c(1, 2), Patient = c("Pt1", 
    "Pt1"), Type = c("Case", "Control"), stringsAsFactors = FALSE)
group <- paste(study, chrom, sep = "/")
setSampleData(tallyFile, group, sampleData)
getSampleData(tallyFile, group)
##   Column Patient  Sample    Type
## 1      1     Pt1     AML    Case
## 2      2     Pt1 Control Control

Note a little complication that derives from the fact that R indexes arrays in a 1-based manner, while HDF5 internally does it 0-based (like, e.g. C). So we set the columns to be 1 and 2, respectively, while within the tally file the values 0 and 1 are stored. The functions setSampleData and getSampleData automatically remove / add 1 from the value.

Extracting tallies from the bam files

Now it is time to extract tally information from the bam file. We use the high-level function applyTallies to do this for us (have a look at the code of that function to see what the separate steps are).

startpos <- 115247090
endpos <- 115259515
theData <- applyTallies(bamFiles, chrom = chrom, start = startpos, stop = endpos, 
    ncycles = 10)  # the first and last 10 sequencing cycles are called unreliable

The resulting object is a list with one slot per dataset using the same layout that you will get from calls to h5readBlock or h5dapply. We could have provided a reference sequence as a DNAString to the applyTallies function, but since we did not do this a majority vote will be used. This will give wrong results for variants with AF >= 0.5 but we will use it here anyway.

Writing to the HDF5 file

Now we write the data to the tally file.

h5write(theData$Counts, tallyFile, paste(group, "Counts", sep = "/"), index = list(NULL, 
    NULL, NULL, startpos:endpos))
h5write(theData$Coverages, tallyFile, paste(group, "Coverages", sep = "/"), 
    index = list(NULL, NULL, startpos:endpos))
h5write(theData$Deletions, tallyFile, paste(group, "Deletions", sep = "/"), 
    index = list(NULL, NULL, startpos:endpos))
h5write(theData$Reference, tallyFile, paste(group, "Reference", sep = "/"), 
    index = list(startpos:endpos))
h5ls(tallyFile)
##     group      name       otype  dclass                    dim
## 0       /      NRAS   H5I_GROUP                               
## 1   /NRAS         1   H5I_GROUP                               
## 2 /NRAS/1    Counts H5I_DATASET INTEGER 12 x 2 x 2 x 249250621
## 3 /NRAS/1 Coverages H5I_DATASET INTEGER      2 x 2 x 249250621
## 4 /NRAS/1 Deletions H5I_DATASET INTEGER      2 x 2 x 249250621
## 5 /NRAS/1 Reference H5I_DATASET INTEGER              249250621

This very explicit step-by-step implementation can be streamlined quite a bit to prevent much of the code-duplication.

Checking if everything worked

We will use the h5readBlock function provided by h5vc to extract the data again and have a look at it.

data <- h5readBlock(filename = tallyFile, group = group, range = c(startpos, 
    endpos))
str(data)
## List of 5
##  $ Counts      : int [1:12, 1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Coverages   : int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Deletions   : int [1:2, 1:2, 1:12426] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Reference   : int [1:12426(1d)] 0 0 0 0 0 0 0 0 0 0 ...
##  $ h5dapplyInfo:List of 4
##   ..$ Blockstart: num 1.15e+08
##   ..$ Blockend  : num 1.15e+08
##   ..$ Datasets  :'data.frame':   4 obs. of  7 variables:
##   .. ..$ group   : chr [1:4] "/NRAS/1" "/NRAS/1" "/NRAS/1" "/NRAS/1"
##   .. ..$ name    : chr [1:4] "Counts" "Coverages" "Deletions" "Reference"
##   .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5 5 5 5
##   .. ..$ dclass  : chr [1:4] "INTEGER" "INTEGER" "INTEGER" "INTEGER"
##   .. ..$ dim     : chr [1:4] "12 x 2 x 2 x 249250621" "2 x 2 x 249250621" "2 x 2 x 249250621" "249250621"
##   .. ..$ DimCount: int [1:4] 4 3 3 1
##   .. ..$ PosDim  : int [1:4] 4 3 3 1
##   ..$ Group     : chr "/NRAS/1"

We will call variants within this gene now:

vars <- h5dapply(filename = tallyFile, group = group, blocksize = 1000, FUN = callVariantsPaired, 
    sampledata = getSampleData(tallyFile, group), cl = vcConfParams(returnDataPoints = TRUE), 
    range = c(startpos, endpos))
vars <- do.call(rbind, vars)
vars
##                     Chrom     Start       End Sample altAllele refAllele
## 115258090:115259089     1 115258747 115258747    AML         G         C
##                     caseCountFwd caseCountRev caseCoverageFwd
## 115258090:115259089            9           22              29
##                     caseCoverageRev controlCountFwd controlCountRev
## 115258090:115259089              49               0               0
##                     controlCoverageFwd controlCoverageRev
## 115258090:115259089                 59                 47

By cleverly selecting the example data we have found exactly one variant that seems ot be interesting and will now plot the region in question to also check if the mismatchPlot function will work with the tally data we created.

position <- vars$End[1]
windowsize <- 50
data <- h5readBlock(filename = tallyFile, group = group, range = c(position - 
    windowsize, position + windowsize))
p <- mismatchPlot(data, getSampleData(tallyFile, group), samples = c("AML", 
    "Control"), windowsize = windowsize, position = position)
print(p)

plot of chunk unnamed-chunk-1

It would seem that we have found a reliable variant here.

Creating tallies in parallel

For larger projects the methods outlined above can be infeasible since sequential processing of all samples would take too much time. In such cases it can be helpful to employ parallelization techniques. Within R/Bioconductor the BatchJobs and BiocParallel packages provide interfaces to help realise such a solution. We will demonstrate simple use cases here (applyTallies uses bplapply internally and we can configure the function to use parallelization through the BPPARAM argument).

It is important to note that for efficient parallel computation of tallies an appropriate compute infrastructure is needed, i.e. you will need many cores ideally on many nodes that access files on high performance file servers. A combination of SSD and local multicore might also be a feasible solution.

Simple Example

To switch our tallying pipeline to parallel execution all we have to do set the BPPARAM argument of applyTallies appropriately, to switch between serial, multicore and HPC parallelization.

We will now execute the normal (serial) version and the multicore version of applyTallies and compare runtimes.

library(BiocParallel)
serial.time <- system.time(theData <- applyTallies(bamFiles, chrom = chrom, 
    start = startpos, stop = endpos, ncycles = 10))
multicore.time <- system.time(theData <- applyTallies(bamFiles, chrom = chrom, 
    start = startpos, stop = endpos, ncycles = 10, BPPARAM = MulticoreParam()))

The serial computation time was:

serial.time
##    user  system elapsed 
##    0.38    0.00    0.38

The multicore parallel computation time was:

multicore.time
##    user  system elapsed 
##   0.440   0.248   0.368

Once we start thinking about larger cohorts we will quickly reach the limits of what multicore processing can do for us. We can configure bplapply to use an available HPC (high performance compute cluster) by providing different arguments as BPPARAM.

As an example of how to set up such an environment we will illustrate the process for the case of LSF.

LSF Example setup

First we need to configure the BatchJobs package to use LSF, we do this by putting a configuration file called .BatchJobs.R into our home directory or the current working directory (if both exist, the latter overrides the former).

This is the example configuration I use.

cluster.functions <- makeClusterFunctionsLSF("/home/pyl/batchjobs/lsf.tmpl")
mail.start <- "first"
mail.done <- "last"
mail.error <- "all"
mail.from <- "<paul.theodor.pyl@embl.de>"
mail.to <- "<pyl@embl.de>"
mail.control <- list(smtpServer = "smtp.embl.de")

For explanations of how to customize this have a look at the BatchJobs documentation here.

The important part is the first line in which we specify that LSF shall be used. The call to makeClusterFunctionsLSF has one parameter specifying a template file for the cluster calls. This template file has the following content.

## Default resources can be set in your .BatchJobs.R by defining the variable 
## 'default.resources' as a named list.

## remove everthing in [] if your cluster does not support arrayjobs
#BSUB-J <%= job.name %>[1-<%= arrayjobs %>]         # name of the job / array jobs
#BSUB-o <%= log.file %>                             # output is sent to logfile, stdout + stderr by default
#BSUB-n <%= resources$ncpus %>                      # Number of CPUs on the node
#BSUB-q <%= resources$queue %>                      # Job queue
#BSUB-W <%= resources$walltime %>                   # Walltime in minutes
#BSUB-M <%= resources$memory %>                     # Memory requirements in Kbytes

# we merge R output with stdout from LSF, which gets then logged via -o option
R CMD BATCH --no-save --no-restore "<%= rscript %>" /dev/stdout

Once this setup is functional we can test it with the following little script.

library("BiocParallel")
library("BatchJobs")
cf <- makeClusterFunctionsLSF("/home/pyl/batchjobs/lsf.tmpl")
bjp <- BatchJobsParam(cluster.functions = cf, resources = list(queue = "medium_priority", 
    memory = "4000", ncpus = "4", walltime = "00:30"))
bplapply(1:10, sqrt)
bplapply(1:10, sqrt, BPPARAM = bjp)

Note how we explicitly define the cluster.functions to again point to the same template file as before to explicitly tell BatchJobsParam to use the LSF cluster. The resources parameter is a named list specifying the resources required for the cluster jobs, the fields correspond to options to the bsub command as is ilustrated in the template file.

We could now do the tallying on the cluster like so:

theData <- applyTallies(bamFiles, chrom = chrom, start = startpos, stop = endpos, 
    ncycles = 10, BPPARAM = bjp)  #the first and last 10 sequencing cycles are called unreliable

Obviously in this case we do not gain anything from parallelizing the calls in this way since the overhead for handling the cluster calls etc. is disproportionately higher than the actual compute time.

If we think about scaling the tallying up to processing the whole genome in blocks of e.g. 100 kilobases of some 20 samples then this setup can be really powerful.

Sometime more fine control is needed and we have to switch from BiocParallel::bplapply to more low-level calls to functions from the BatchJobs package. An example workflow I plan to outline below – when I get to it :)