Scalable nucleotide tallies with HDF5

Abstract

With the increased affordability of genomic sequencing technologies we have at our disposal a flood of whole genome/exome datasets from many patients. When using the .bam file format large scale comparative analyses become cumbersome to implement and a lot of time is spend parsing files and collating data. Given that the tally (the table of counts of nucleotides x sample x strand x genomic position) is at the center of many genomics analyses (e.g. variant calling, copy-number estimation, mutation spectrum analysis, etc.) it is crucial to collate this information in an easily accessible format. Here we propose the use of HDF5 as this format and give concrete examples of the internal layout of a tally-HDF5 file and possible applications. We introduce tools for the creation of a tally file from a set of .bam files and tools for interacting with the tally file and performing genomics analyses.

Motivation

Currently there is a large interest in analyses of cancer genome data across large cohorts, but the current standard file formats are ill-suited to the task. The BAM-file format is too low-level and does not scale well (large file size; slicing by genomic position is usually inefficient) while the VCF format is too high-level with an emphasis on (positive) calls with low FP rate and no control over negative calls and FN rate (just considering every position not mentioned in the VCF as 'no variant' would have have a high FN rate, esp. in the face of subclonality and coverage fluctuations). There is a need for an intermediate format that is scalable, compact and cross-platform accessible.

The 'tally' data structure

The tally data structure proposed here consists of 4 datasets that are stored for each chromosome (or contig). Those datasets are:

We outline the basic layout of this set of tables here:

Name Dimensions Datatype
Counts [ #bases, #samples, #strands, #positions ] int
Coverages [ #samples, #strands, #positions ] int
Deletions [ #samples, #strands, #positions ] int
Reference [ #positions ] int

An HDF5 file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:

The layout of a tally `HDF5` file.

A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.

Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5 attributes. Convenience functions for extracting the metadata are provided, see examples below.

Interacting with tally files

In the first step we load the h5vc package and have a look at the example tally file provided with the h5vcData package.

library(h5vc)
## Loading required package: grid
## Loading required package: gridExtra
## Loading required package: ggplot2
tallyFile <- system.file("extdata", "example.tally.hfs5", package = "h5vcData")

Now we can have a look into the tally file. We use the h5ls function from the rhdf5 package.

library(rhdf5)
h5ls(tallyFile)
##               group         name       otype  dclass                   dim
## 0                 / ExampleStudy   H5I_GROUP                              
## 1     /ExampleStudy           16   H5I_GROUP                              
## 2  /ExampleStudy/16       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3  /ExampleStudy/16    Coverages H5I_DATASET INTEGER      6 x 2 x 90354753
## 4  /ExampleStudy/16    Deletions H5I_DATASET INTEGER      6 x 2 x 90354753
## 5  /ExampleStudy/16    Reference H5I_DATASET INTEGER              90354753
## 6     /ExampleStudy           22   H5I_GROUP                              
## 7  /ExampleStudy/22       Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8  /ExampleStudy/22    Coverages H5I_DATASET INTEGER      6 x 2 x 51304566
## 9  /ExampleStudy/22    Deletions H5I_DATASET INTEGER      6 x 2 x 51304566
## 10 /ExampleStudy/22    Reference H5I_DATASET INTEGER              51304566

This returns a table of all the objects present within the tally file: /home/biocbuild/bbs-3.0-bioc/R/library/h5vcData/extdata/example.tally.hfs5. This table encodes the tree-like structure shown in Figure 1.

We can extract the separately stored sample data from the tally file by using the getSampleData function:

sampleData <- getSampleData(tallyFile, "/ExampleStudy/16")
print(sampleData)
##                      SampleFiles           Sample Column    Type  Patient
## 1     ../Input/PT8PrimaryDNA.bam    PT8PrimaryDNA      6    Case Patient8
## 2     ../Input/PT5PrimaryDNA.bam    PT5PrimaryDNA      2    Case Patient5
## 3     ../Input/PT5RelapseDNA.bam    PT5RelapseDNA      3    Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA      5    Case Patient8
## 5     ../Input/PT5ControlDNA.bam    PT5ControlDNA      1 Control Patient5
## 6     ../Input/PT8ControlDNA.bam    PT8ControlDNA      4 Control Patient8

Extracting data from a tally file

We use the h5dapply function as our general purpose accessor to data stored in HDF5 based tally files and provide the following parameters:

Using h5dapply blank to get the data directly

When not specifying the function to be applied (argument FUN) we get the data returned that would otherwise be supplied to the function as it's first argument (i.e. FUN = function(x) x).

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"), 
    blocksize = 5000, range = c(2.9e+07, 29010000))
str(data)
## List of 2
##  $ 2.9e+07:29004999 :List of 2
##   ..$ Coverages   : int [1:6, 1:2, 1:5000] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ h5dapplyInfo:List of 4
##   .. ..$ Blockstart: num 2.9e+07
##   .. ..$ Blockend  : num 2.9e+07
##   .. ..$ Datasets  :'data.frame':    1 obs. of  7 variables:
##   .. .. ..$ group   : chr "/ExampleStudy/16"
##   .. .. ..$ name    : chr "Coverages"
##   .. .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5
##   .. .. ..$ dclass  : chr "INTEGER"
##   .. .. ..$ dim     : chr "6 x 2 x 90354753"
##   .. .. ..$ DimCount: int 3
##   .. .. ..$ PosDim  : int 3
##   .. ..$ Group     : chr "/ExampleStudy/16"
##  $ 29005000:29009999:List of 2
##   ..$ Coverages   : int [1:6, 1:2, 1:5000] 24 26 17 12 30 19 25 33 9 9 ...
##   ..$ h5dapplyInfo:List of 4
##   .. ..$ Blockstart: num 2.9e+07
##   .. ..$ Blockend  : num 2.9e+07
##   .. ..$ Datasets  :'data.frame':    1 obs. of  7 variables:
##   .. .. ..$ group   : chr "/ExampleStudy/16"
##   .. .. ..$ name    : chr "Coverages"
##   .. .. ..$ otype   : Factor w/ 7 levels "H5I_FILE","H5I_GROUP",..: 5
##   .. .. ..$ dclass  : chr "INTEGER"
##   .. .. ..$ dim     : chr "6 x 2 x 90354753"
##   .. .. ..$ DimCount: int 3
##   .. .. ..$ PosDim  : int 3
##   .. ..$ Group     : chr "/ExampleStudy/16"

As you can see the data object is a list with one element per block and each of those is a list with one slot for each dataset (only “Coverages” in our example; note the dimensions of the “Coverage” array are [1:6, 1:2, 1:10000], i.e. 6 samples, 2 strands and 10000 position in the current block). Furthermore the slot “h5dapplyInfo” is being attached to the list data and is itself a list containing information about the current block (“Blockstart”,“Blockend”), the Datasets represented here (slot “Datasets”) and the group we are processing right now (“/ExampleStudy/16”). This information can be used within the call to FUN to, e.g. calculate the correct offset for genomic positions by adding the block start etc.

Example of using h5dapply to extract coverage data

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/16", names = c("Coverages"), 
    blocksize = 1000, range = c(2.9e+07, 29010000), FUN = function(x) rowSums(x$Coverages), 
    verbose = TRUE)
## Processing Block #1: 2.9e+07:29000999
## Processing Block #2: 29001000:29001999
## Processing Block #3: 29002000:29002999
## Processing Block #4: 29003000:29003999
## Processing Block #5: 29004000:29004999
## Processing Block #6: 29005000:29005999
## Processing Block #7: 29006000:29006999
## Processing Block #8: 29007000:29007999
## Processing Block #9: 29008000:29008999
## Processing Block #10: 29009000:29009999
coverages <- do.call(rbind, data)
colnames(coverages) <- sampleData$Sample[order(sampleData$Column)]
coverages
##                   PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 2.9e+07:29000999          32315         40117         18158         18998
## 29001000:29001999         38044         48938         17083         19910
## 29002000:29002999         43580         58418         22193         21437
## 29003000:29003999         40858         55464         22724         21332
## 29004000:29004999         42560         50930         23465         21304
## 29005000:29005999         44662         60327         22169         21701
## 29006000:29006999         43503         60565         20120         17943
## 29007000:29007999         41379         49689         23940         19213
## 29008000:29008999         41048         52724         21879         19402
## 29009000:29009999         37825         47701         19167         18208
##                   PT8EarlyStageDNA PT8PrimaryDNA
## 2.9e+07:29000999             36377         36113
## 29001000:29001999            38014         36222
## 29002000:29002999            49731         43917
## 29003000:29003999            48404         41585
## 29004000:29004999            40319         36121
## 29005000:29005999            44560         43149
## 29006000:29006999            45475         40463
## 29007000:29007999            47064         44508
## 29008000:29008999            48728         39106
## 29009000:29009999            35987         34718

Here we want to apply the function function(x) rowSums(x$Coverages) to the genomic interval 16:29000000-29010000 in blocks of size 1000 and for each of the blocks we want to only extract the dataset “Coverages” since we only access this dataset in the applied function and extracting the others would be unneccessary additional work.

The return value of h5dapply is a list with one element per block containing the output of the function FUN that will be applied to each block. In the example above we compute the sum of coverages per sample within each block, then use do.call(rbind, data) to merge the results into a matrix and set the column names to the respective sample ids that are specified in the sampleData object we loaded from the tally file.

Example of more involved coverage analysis

Let's extract the coverage in 500bp bins from a small region on chromosome 22. We use the binnedCoverage function suplied by h5vc, which is almost identical to the function(x) rowSums(x$Coverages) used above plus some checks and reformatting of the output.

data <- h5dapply(filename = tallyFile, group = "/ExampleStudy/22", names = c("Coverages"), 
    blocksize = 500, range = c(38750000, 39250000), FUN = binnedCoverage, sampledata = sampleData)
data <- do.call(rbind, data)
rownames(data) = NULL
head(data)
##   PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA PT8EarlyStageDNA
## 1         17276         25603         13203         25869            24315
## 2         19973         25611         11227         26917            22674
## 3         22130         25876         11735         23568            22064
## 4         26193         22309         11056         21408            19888
## 5         20538         21349          9439         22409            18602
## 6         23880         26834         14427         25372            23910
##   PT8PrimaryDNA    Start      End
## 1          7596 38750000 38750499
## 2         10032 38750500 38750999
## 3         10086 38751000 38751499
## 4         13041 38751500 38751999
## 5         10443 38752000 38752499
## 6          8864 38752500 38752999

The library size can be estimated from the colSums of the data object.

librarySizes <- colSums(data[, 1:6])
sizeFactors <- librarySizes/mean(librarySizes)
sizeFactors
##    PT5ControlDNA    PT5PrimaryDNA    PT5RelapseDNA    PT8ControlDNA 
##           1.1424           1.3641           0.5777           1.2359 
## PT8EarlyStageDNA    PT8PrimaryDNA 
##           1.1440           0.5359
for (n in names(sizeFactors)) {
    data[[n]] <- data[[n]]/sizeFactors[n]
}

Next we can plot the pairwise log2 ratios of the size factor adjusted coverage in this region.

suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(ggplot2))
pairWiseRatio <- function(data, sampledata, trans = log2) {
    sampledata <- subset(sampledata, Sample %in% colnames(data))
    ret <- data.frame(Start = data$Start)
    for (patient in sampledata$Patient) {
        control <- subset(sampledata, Patient == patient & Type == "Control")$Sample
        if (length(control) != 1) {
            stop(paste("Exactly one control sample per patient is expected. Found", 
                length(control), "in patient", patient))
        }
        for (case in subset(sampledata, Patient == patient & Type == "Case")$Sample) {
            ret[[paste(case, control, sep = "/")]] <- trans(data[[case]]/data[[control]])
        }
    }
    return(ret)
}
plot_data <- melt(pairWiseRatio(data, sampleData), id.vars = c("Start"))
colnames(plot_data) = c("Position", "Samples", "Ratio")
ggplot(plot_data, aes(x = Position, y = Ratio, color = Samples)) + geom_line() + 
    facet_wrap(~Samples, ncol = 1)

plot of chunk plotLog2Ratios

Tools for variant calling and visualization

In this part of the vignette we will perform some basic variant calling on a tally file and explore the visualization tools provided by h5vc. We will use the example tally file from the previous section to work on.