1 Getting started

GenomicScores is an R package distributed as part of the Bioconductor project. To install the package, start R and enter:

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicScores")

Once GenomicScores is installed, it can be loaded with the following command.

library(GenomicScores)

Often, however, GenomicScores will be automatically loaded when working with an annotation package that uses GenomicScores, such as phastCons100way.UCSC.hg19.

2 Genomewide position-specific scores

Genomewide scores assign each genomic position a numeric value denoting an estimated measure of constraint or impact on variation at that position. They are commonly used to filter single nucleotide variants or assess the degree of constraint or functionality of genomic features. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains, physical-chemical changes of amino acid residues, etc.

One particular example of genomic scores are phastCons scores. They provide a measure of conservation obtained from genomewide alignments using the program phast (Phylogenetic Analysis with Space/Time models) from Siepel et al. (2005). The GenomicScores package allows one to retrieve these scores through annotation packages (Section 4) or as AnnotationHub resources (Section 5).

Often, genomic scores such as phastCons are used within workflows running on top of R and Bioconductor. The purpose of the GenomicScores package is to enable an easy and interactive access to genomic scores within those workflows.

3 Lossy storage of genomic scores with compressed vectors

Storing and accessing genomic scores within R is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, for phastCons (Siepel et al. 2005), CADD (Kircher et al. 2014) or M-CAP (Jagadeesh et al. 2016) scores.

We address this problem by using lossy compression, also called quantization, coupled with run-length encoding (Rle) vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data (Zender 2016).

Sometimes, measurements and statistical estimates under certain models generate false precision. False precision is essentialy noise that wastes storage space and it is meaningless from the scientific point of view (Zender 2016). In those circumstances, lossy compression not only saves storage space, but also removes false precision.

The use of lossy compression leads to a subset of quantized values much smaller than the original set of genomic scores, resulting in long runs of identical values along the genome. These runs of identical values can be further compressed using the implementation of Rle vectors available in the S4Vectors Bioconductor package.

To enable a seamless access to genomic scores stored with quantized values in compressed vectors the GenomicScores defines the GScores class of objects. This class manages the location, loading and dequantization of genomic scores stored separately on each chromosome. A further class, called MafDb, is derived from GScores to store minor allele frequency (MAF) data accounting for specific features of this kind of values such as their organization into populations of individuals.

4 Retrieval of genomic scores through annotation packages

There are currently four different annotation packages that store genomic scores and can be accessed using the GenomicScores package; see Table 1.


Table 1: Bioconductor annotation packages storing genomic scores
Annotation Package Description
phastCons100way.UCSC.hg19 phastCons scores derived from the alignment of the human genome (hg19) to other 99 vertebrate species.
phastCons100way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 99 vertebrate species.
phastCons7way.UCSC.hg38 phastCons scores derived from the alignment of the human genome (hg38) to other 6 mammal species.
fitCons.UCSC.hg19 fitCons scores: fitness consequences of functional annotation for the human genome (hg19).

This is an example of how genomic scores can be retrieved using the phastCons100way.UCSC.hg19 package. Here, a GScores object is created when the package is loaded.

library(phastCons100way.UCSC.hg19)
library(GenomicRanges)
gsco <- phastCons100way.UCSC.hg19
class(gsco)
## [1] "GScores"
## attr(,"package")
## [1] "GenomicScores"

The halp page of the GScores class describes the different methods to access the information and metadata stored in a GScores object. To retrieve genomic scores for specific positions we should use the function scores(), as follows.

scores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=1)))
## GRanges object with 1 range and 1 metadata column:
##       seqnames                 ranges strand |    scores
##          <Rle>              <IRanges>  <Rle> | <numeric>
##   [1]     chr7 [117232380, 117232380]      * |       0.8
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The GenomicScores package only loads the scores data from one sequence to retrieve metadata and from the sequences that are being queried. Note that now the GScores object has loaded the scores from chr7.

gsco
## GScores object 
## # organism: Homo sapiens (UCSC, hg19)
## # provider: UCSC
## # provider version: 09Feb2014
## # download date: Mar 17, 2017
## # loaded sequences: chr19_gl000208_random, chr7
## # maximum abs. error: 0.05
## # use 'citation()' to know how to cite these data in publications

The bibliographic reference to cite the genomic scores stored in a GScores object can be accessed using the citation() method either on the package name or on the GScores object. The latter is implemented in the GenomicScores package and provides a bibentry object.

citation(gsco)
## Adam Siepel, Gill Berejano, Jakob S. Pedersen, Angie S. Hinrichs,
## Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W.
## Hillier, Stephen Richards, George M. Weinstock, Richard K. Wilson,
## Richard A. Gibbs, W. James Kent, Webb Miller and David Haussler
## (2005). "Evolutionarily conserved elements in vertebrate, insect,
## worm, and yeast genomes." _Genome Research_, *15*, pp. 1034-1050.
## doi: 10.1101/gr.3715005 (URL: http://doi.org/10.1101/gr.3715005).

Other methods tracing provenance and other metada