In this document we will illustrate the use of the h5vc package for creating and analysing nucleotide tallies of next generation sequencing experiments.
h5vc is a tool that is designed to provide researchers with a more intuitive and effective way of interacting with data from large cohorts of samples that have been sequenced with next generation sequencing technologies.
The challenges researchers face with the advent of massive sequencing efforts aimed at sequencing RNA and DNA of thousands of samples will need to be addressed now, before the flood of data becomes a real problem.
The effects of the infeasibility of handling the sequencing data of large cohorts with the current standards (BAM, VCF, BCF, GTF, etc.) have become apparent in recent publications that performed population level analyses of mutations in many cancer samples and work exclusively on the level of preprocessed variant calls stored in VCF/MAF files simply because there is no way to look at the data itself with reasonable resource usage (e.g. in Kandoth et. al 2013).
This challenge can be adressed by augmenting the available legacy formats typically used for sequencing analyses (SAM/BAM files) with an intermediate file format that stores only the most essential information and provides efficient access to the cohort level data whilst reducing the information loss relative to the raw alignments.
This file format will store nucleotide tallies rather than alignments and allow for easy and efficient real-time random access to the data of a whole cohort of samples. The details are described in the following section.
The tally data structure proposed here consists of 5 datasets that are stored for each chromosome (or contig). Those datasets are: * Counts: A table that contains the number of observed mismatches at any combination of base, sample, strand and genomic position, * Coverages: A table that contains the number of reads overlapping at any combination of sample, strand and genomic position * Deletions: A Table that contains the number of observed deletions of bases at any combination of sample, strand and genomic position * Insertions: A Table that contains the number of observed insertions of bases at any combination of sample, strand and genomic position (showing insertions following the position) * Reference: A Table that contains the reference base against which the calls in the ‘Deletions’ and ‘Counts’ table were made.
We outline the basic layout of this set of tables here:
Name | Dimensions | Datatype |
---|---|---|
Counts | [ #bases, #samples, #strands, #positions ] | int |
Coverages | [ #samples, #strands, #positions ] | int |
Deletions | [ #samples, #strands, #positions ] | int |
Insertions | [ #samples, #strands, #positions ] | int |
Reference | [ #positions ] | int |
An HDF5
file has an internal structure that is similar to a file system, where groups are the directories and datasets are the files. The layout of the tally file is as follows:
A tally file can contain data from more than one study but each study will reside in a separte tree with a group named with the study-ID at its root and sub-groups for all the chromosomes / contigs that are present in the study. Attached to each of the chromosome groups are the 4 datasets described above.
Additionally each chromsome group stores sample data about the samples involved in the experiment (patientID, type, location in the sample dimension) as HDF5
attributes. Convenience functions for extracting the metadata are provided, see examples below.
Before we get into the details of how to generate an HDF5 tally file for a set of sequencing experiments we will show some examples of the possible analyses one can perform on such a file. The tally file we will use is provided with the h5vcData package and if you have not installed this so far you should do so now.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("h5vcData")
The first thing we do is set up the session by loading the h5vc and rhdf5 packages and finding the location of the example tally file.
suppressPackageStartupMessages(library(h5vc))
## Warning: replacing previous import 'Biostrings::pattern' by 'grid::pattern'
## when loading 'h5vc'
suppressPackageStartupMessages(library(rhdf5))
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
We can inspect the data contained in this file with the h5ls
function provided by rhdf5.
h5ls(tallyFile)
## group name otype dclass dim
## 0 / ExampleStudy H5I_GROUP
## 1 /ExampleStudy 16 H5I_GROUP
## 2 /ExampleStudy/16 Counts H5I_DATASET INTEGER 12 x 6 x 2 x 90354753
## 3 /ExampleStudy/16 Coverages H5I_DATASET INTEGER 6 x 2 x 90354753
## 4 /ExampleStudy/16 Deletions H5I_DATASET INTEGER 6 x 2 x 90354753
## 5 /ExampleStudy/16 Reference H5I_DATASET INTEGER 90354753
## 6 /ExampleStudy 22 H5I_GROUP
## 7 /ExampleStudy/22 Counts H5I_DATASET INTEGER 12 x 6 x 2 x 51304566
## 8 /ExampleStudy/22 Coverages H5I_DATASET INTEGER 6 x 2 x 51304566
## 9 /ExampleStudy/22 Deletions H5I_DATASET INTEGER 6 x 2 x 51304566
## 10 /ExampleStudy/22 Reference H5I_DATASET INTEGER 51304566
In the resulting output we can find the different groups and datasets present in the file and we can extract the relevant sample data attached to those groups in the following way.
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
sampleData
## SampleFiles Sample Column Type Patient
## 1 ../Input/PT8PrimaryDNA.bam PT8PrimaryDNA 6 Case Patient8
## 2 ../Input/PT5PrimaryDNA.bam PT5PrimaryDNA 2 Case Patient5
## 3 ../Input/PT5RelapseDNA.bam PT5RelapseDNA 3 Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA 5 Case Patient8
## 5 ../Input/PT5ControlDNA.bam PT5ControlDNA 1 Control Patient5
## 6 ../Input/PT8ControlDNA.bam PT8ControlDNA 4 Control Patient8
The sampleData
object is a data.frame
that contains information about the samples whose nucleotide tallies are present in the file. We can modify this object (e.g. add new columns) and write it back to the file using the setSampleData
function, but we must be aware that a certain set of columns have to be present (Sample
, Patient
, Column
and Type
).
sampleData$ClinicalVariable <- rnorm(nrow(sampleData))
setSampleData( tallyFile, "/ExampleStudy/16", sampleData )
sampleData
## SampleFiles Sample Column Type Patient
## 1 ../Input/PT8PrimaryDNA.bam PT8PrimaryDNA 6 Case Patient8
## 2 ../Input/PT5PrimaryDNA.bam PT5PrimaryDNA 2 Case Patient5
## 3 ../Input/PT5RelapseDNA.bam PT5RelapseDNA 3 Case Patient5
## 4 ../Input/PT8PreLeukemiaDNA.bam PT8EarlyStageDNA 5 Case Patient8
## 5 ../Input/PT5ControlDNA.bam PT5ControlDNA 1 Control Patient5
## 6 ../Input/PT8ControlDNA.bam PT8ControlDNA 4 Control Patient8
## ClinicalVariable
## 1 -0.1222107
## 2 0.4634323
## 3 -0.1524877
## 4 -0.2706385
## 5 1.0789223
## 6 0.8229085
Now that we can find the sample metadata in the file it is time to extract some of the nuclotide tally data stored there. We can use two functions to achieve this, h5readBlock
can be used to read a specified block of data along a given dimension (e.g. a region along the genomic position) and h5dapply
can be used to apply a function in a blockwise fashion along a specified dimension (e.g. calculating coverage in bins of a certain size along the genomic position dimension).
We can read out a block of data in the following way:
data <- h5readBlock(
filename = tallyFile,
group = "/ExampleStudy/16",
names = c( "Coverages", "Counts" ),
range = c(29000000,29001000)
)
str(data)
## List of 3
## $ Coverages : int [1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
## $ Counts : int [1:12, 1:6, 1:2, 1:1001] 0 0 0 0 0 0 0 0 0 0 ...
## $ h5dapplyInfo:List of 4
## ..$ Blockstart: int 29000000
## ..$ Blockend : int 29001000
## ..$ Datasets :'data.frame': 2 obs. of 3 variables:
## .. ..$ Name : chr [1:2] "Coverages" "Counts"
## .. ..$ DimCount: int [1:2] 3 4
## .. ..$ PosDim : int [1:2] 3 4
## ..$ Group : chr "/ExampleStudy/16"
When we want to read multiple blocks of data we can use the h5dapply
function which supports the usage of IRanges
and GRanges
to define regions of interest, although a simpler system where the user specifies only a start
, end
and blocksize
parameter exists.
suppressPackageStartupMessages(require(GenomicRanges))
data <- h5dapply(
filename = tallyFile,
group = "/ExampleStudy",
names = c( "Coverages" ),
dims = c(3),
range = GRanges("16", ranges = IRanges(start = seq(29e6, 30e6, 5e6), width = 1000))
)
str(data)
## List of 1
## $ 16:List of 1
## ..$ 29000000:29000999:List of 2
## .. ..$ Coverages : int [1:6, 1:2, 1:1000] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..$ h5dapplyInfo:List of 4
## .. .. ..$ Blockstart: int 29000000
## .. .. ..$ Blockend : int 29000999
## .. .. ..$ Datasets :'data.frame': 1 obs. of 3 variables:
## .. .. .. ..$ Name : chr "Coverages"
## .. .. .. ..$ DimCount: int 3
## .. .. .. ..$ PosDim : num 3
## .. .. ..$ Group : chr "/ExampleStudy/16"
Usually we do not want to load the data of all those blocks into memory (unless we need it for plotting). A typical workflow will involve some form of processing of the data and as long as this processing can be expressed as an R function that can be applied to each block separately, we can simply provide h5dapply
with this function and only retrieve the results. In the following example we calculate the coverage of the samples present in the example tally file by applying the binnedCoverage
function to blocks defined in a GRanges object.
#rangeA <- GRanges("16", ranges = IRanges(start = seq(29e6, 29.5e6, 1e5), width = 1000))
#rangeB <- GRanges("22", ranges = IRanges(start = seq(39e6, 39.5e6, 1e5), width = 1000))
range <- GRanges(
rep(c("16", "22"), each = 6),
ranges = IRanges(
start = c(seq(29e6, 29.5e6, 1e5),seq(39e6, 39.5e6, 1e5)),
width = 1000
))
coverages <- h5dapply(
filename = tallyFile,
group = "/ExampleStudy",
names = c( "Coverages" ),
dims = c(3),
range = range,
FUN = binnedCoverage,
sampledata = sampleData
)
#options(scipen=10)
coverages <- do.call( rbind, lapply( coverages, function(x) do.call(rbind, x) ))
#rownames(coverages) <- NULL #remove block-ids used as row-names
coverages
## PT5ControlDNA PT5PrimaryDNA PT5RelapseDNA PT8ControlDNA
## 16.29000000:29000999 36113 40117 18158 36377
## 16.29100000:29100999 47081 58134 20670 56402
## 16.29200000:29200999 51110 46735 22911 46598
## 16.29300000:29300999 41432 50366 23395 42309
## 16.29400000:29400999 22083 29795 11629 30813
## 16.29500000:29500999 101 101 0 202
## 22.39000000:39000999 47677 50990 23573 46070
## 22.39100000:39100999 42456 48328 19178 46106
## 22.39200000:39200999 49749 64833 25383 52946
## 22.39300000:39300999 55279 56861 26796 57834
## 22.39400000:39400999 35512 55103 18155 48502
## 22.39500000:39500999 41028 54033 20925 51626
## PT8EarlyStageDNA PT8PrimaryDNA Chrom Start End
## 16.29000000:29000999 32315 18998 16 29000000 29000999
## 16.29100000:29100999 53987 17228 16 29100000 29100999
## 16.29200000:29200999 50203 20391 16 29200000 29200999
## 16.29300000:29300999 45285 18961 16 29300000 29300999
## 16.29400000:29400999 24986 9293 16 29400000 29400999
## 16.29500000:29500999 101 0 16 29500000 29500999
## 22.39000000:39000999 47232 20825 22 39000000 39000999
## 22.39100000:39100999 45135 18827 22 39100000 39100999
## 22.39200000:39200999 53028 24135 22 39200000 39200999
## 22.39300000:39300999 51304 23554 22 39300000 39300999
## 22.39400000:39400999 39690 21945 22 39400000 39400999
## 22.39500000:39500999 37666 18943 22 39500000 39500999
Note that binnedCoverage
takes an additional parameter sampleData
which we provide to the function as well. Furthermore we specify the blocksize
to be 500 bases and we specify the dims
parameter to tell h5dapply
along which dimension of the selected datasets (“Coverages” in this case) shall be processed (dimension number 3 is the genomic position in the “Coverages” dataset). The explicit specification of dims
is only neccessary when we are not extracting the “Counts” dataset, otherwise it defaults to the genomic position.
In the same way we can perform variant calling by using h5dapply
together with a variant calling function like callVariantsPaired
or callVariantsSingle
.
variants <- h5dapply(
filename = tallyFile,
group = "/ExampleStudy/16",
names = c( "Coverages", "Counts", "Deletions", "Reference" ),
range = c(29950000,30000000),
blocksize = 10000,
FUN = callVariantsPaired,
sampledata = sampleData,
cl = vcConfParams(returnDataPoints = TRUE)
)
variants <- do.call( rbind, variants )
variants$AF <- (variants$caseCountFwd + variants$caseCountRev) / (variants$caseCoverageFwd + variants$caseCoverageRev)
variants <- variants[variants$AF > 0.2,]
rownames(variants) <- NULL # remove rownames to save some space on output :D
variants
## Chrom Start End Sample altAllele refAllele caseCountFwd
## 1 16 29950746 29950746 PT8PrimaryDNA - T 10
## 2 16 29983015 29983015 PT5PrimaryDNA G C 12
## 3 16 29983015 29983015 PT5RelapseDNA G C 3
## 4 16 29983015 29983015 PT8EarlyStageDNA G C 8
## caseCountRev caseCoverageFwd caseCoverageRev controlCountFwd controlCountRev
## 1 3 34 29 0 0
## 2 13 29 27 0 0
## 3 4 10 9 0 0
## 4 14 19 30 0 0
## controlCoverageFwd controlCoverageRev backgroundFrequencyFwd
## 1 10 15 0.1
## 2 11 19 0.0
## 3 11 19 0.0
## 4 13 10 0.0
## backgroundFrequencyRev pValueFwd pValueRev caseCount controlCount
## 1 0.0754717 0.001365664 0.3761485 13 0
## 2 0.0000000 0.000000000 0.0000000 25 0
## 3 0.0000000 0.000000000 0.0000000 7 0
## 4 0.0000000 0.000000000 0.0000000 22 0
## caseCoverage controlCoverage AF
## 1 63 25 0.2063492
## 2 56 30 0.4464286
## 3 19 30 0.3684211
## 4 49 23 0.4489796
For details about the parameters and behaviour of callVariantsPaired
have a look at the corresponding manual page ( i.e. ?callVariantsPaired
).
A function has to have a named parameter data
as its first argument in order to be compatible with h5dapply
, in this case data is a list of the same structure as the one returned by h5readBlock
.
Once we have determined the location of an interesting variant, like 16:29983015-29983015:C/G
in our case, we can create a mismatchPlot
in the region around it to get a better feeling for the variant. To this end we use the mismatchPlot
function on the tallies in the region in the following way:
windowsize <- 35
position <- variants$Start[2]
data <- h5readBlock(
filename = tallyFile,
group = paste( "/ExampleStudy", variants$Chrom[2], sep="/" ),
names = c("Coverages","Counts","Deletions", "Reference"),
range = c( position - windowsize, position + windowsize)
)
patient <- sampleData$Patient[sampleData$Sample == variants$Sample[2]]
samples <- sampleData$Sample[sampleData$Patient == patient]
p <- mismatchPlot(
data = data,
sampledata = sampleData,
samples = samples,
windowsize = windowsize,
position = position
)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
print(p)
This plot shows the region 35 bases up and downstream of the variant. It shows one panel for each sample associated with the patient that carries the variant (selected by the line sampleData$Sample[sampleData$Patient == patient]
) and each panel is centered on the varian position in the x-axis and the y-axis encodes coverage and mismatches (negative values are on the reverse strand). The grey area is the coverage and the coloured boxes are mismatches. For more details on this plot see ?mismatchPlot
.
The object returned by mismatchPlot
is a ggplot
object which can be manipulated in the same way as any other plot generated through a call to ggplot
. We can for example apply a theme to the plot (see ?ggplot2::theme
for a list of possible options).
print(p + theme(strip.text.y = element_text(family="serif", size=16, angle=0)))