.

.

.

The SeqArray package is designed for R programming environment, and enables high-performance computing in the multi-core symmetric multiprocessing and loosely coupled computer cluster framework. The features of SeqArray are extended with other existing R packages for WGS data analyses, and the R codes for demonstration are available in the package vignette R Integration.

Figure 1: SeqArray framework and flowchart. The SeqArray format is built on top of the Genomic Data Structure (GDS) format, and GDS is a generic data container with hierarchical structure for storing multiple array-oriented data sets. A high-level R interface to GDS files is provided in the gdsfmt package with a C++ library, and the SeqArray package offers functionalities specific to sequencing data. At a minimum a SeqArray file contains sample and variant identifiers, position, chromosome, reference and alternate alleles for each variant. Parallel computing environments, like multi-core computer clusters, are enabled with SeqArray. The functionality of SeqArray is extended by SeqVarTools, SNPRelate, GENESIS and other R/Bioconductor packages for WGS analyses.

.

library(SeqArray)
## Loading required package: gdsfmt
# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
file <- seqOpen(seqExampleFileName("KG_Phase1"))

seqSummary(file)
## File: /tmp/Rtmpx1Y7nP/Rinst2d9224431095/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds
## Format Version: v1.0
## Reference: GRCh37
## Ploidy: 2
## Number of samples: 1,092
## Number of variants: 19,773
## Chromosomes:
##     Chr22: 19773
## Alleles:
##     DEL, Deletion
##     tabulation: 2, 19773(100.0%)
## Annotation, Quality:
##     Min: 0, 1st Qu: 100, Median: 100, Mean: 110.146536457016, 3rd Qu: 100, Max: 3002, NA's: 10
## Annotation, FILTER:
##     PASS, , 19773(100.0%)
## Annotation, INFO variable(s):
##     <None>
## Annotation, FORMAT variable(s):
##     GT, 1, String, Genotype
## Annotation, sample variable(s):
##     Family.ID, String, <NA>
##     Population, String, <NA>
##     Gender, String, <NA>

.

.

.

.

.

.

.

1 SeqArray Functions

1.1 Key R Functions

Table 1: The key functions in the SeqArray package.

Function Description
seqVCF2GDS Reformat VCF files. »
seqSetFilter Define a data subset of samples or variants. »
seqGetData Get data from a SeqArray file with a defined filter. »
seqApply Apply a user-defined function over array margins. »
seqParallel Apply functions in parallel. »

Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the R programming language. Table 1 lists five key functions provided in the SeqArray package and many data analyses can be done using just these functions.

seqVCF2GDS() converts VCF files to SeqArray format. Multiple cores in an SMP architecture within one or more compute nodes in a compute cluster can be used to simultaneously reformat the data. seqVCF2GDS() utilizes R’s connection interface to read VCF files incrementally, and it is able to import data from http/ftp texts and the standard output of a command-line tool via a pipe.

seqSetFilter() and seqGetData() can be used together to retrieve data for a selected set of samples from a defined genomic region. GRanges and GRangesList objects defined in the Bioconductor core packages are supported via seqSetFilter() (Gentleman et al. 2004; Lawrence et al. 2013).

seqApply() applies a user-defined function to array margins of genotypes and annotations. The function that is applied can be defined in R as is typical, or via C/C++ code using the Rcpp package (Eddelbuettel et al. 2011). seqParallel() utilizes the facilities in the parallel package (Rossini, Tierney, and Li 2007; R Core Team 2016) to perform calculations on a SeqArray file in parallel.

1.2 Calculating Allele Frequencies

We illustrate the SeqArray functions by implementing an example to calculate the frequency of reference allele across all chromosomes. If a genomic region is specified via seqSetFilter(), the calculation is performed within the region instead of using all variants. seqApply() enables applying a user-defined function to the margin of genotypes, and the R code is shown as follows:

af <- seqApply(file, "genotype", as.is="double", margin="by.variant",
    FUN=function(x) { mean(x==0L, na.rm=TRUE) })
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842

where file is a SeqArray file, as.is indicates returning a numeric vector, margin is specified for applying the function by variant. The variable x in the user-defined function is an allele-by-sample integer matrix at a site and 0L denotes the reference allele where the suffix L indicates the number is an integer.

The Rcpp package simplifies integration of compiled C++ code with R (Eddelbuettel et al. 2011), and the function can be dynamically defined with inlined C/C++ codes:

library(Rcpp)

cppFunction("
    double CalcAlleleFreq(IntegerVector x)
    {
        int len=x.size(), n=0, n0=0;
        for (int i=0; i < len; i++)
        {
            int g = x[i];
            if (g != NA_INTEGER)
            {
                n++;
                if (g == 0) n0++;
            }
        }
        return double(n0) / n;
    }")

where IntegerVector indicates the input variable x is an integer vector, NA_INTEGER is missing value and the function counts how many zeros and non-missing values for calculating frequency. The name CalcAlleleFreq can be passed to seqApply() directly:

af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=CalcAlleleFreq)
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842

The C++ integration is several times faster than the R implementation, suggesting an efficient approach with C/C++ when real-time performance is required.

It is able to run the calculation in parallel. The genotypes of a SeqArray file are automatically split into non-overlapping parts according to different variants or samples, and the results from client processes collected internally:

af <- seqApply(file, "genotype", as.is="double",
    margin="by.variant", FUN=function(x) { mean(x==0L, na.rm=TRUE) }, parallel=2)
head(af)
## [1] 0.6950549 0.9432234 0.9995421 0.9995421 0.9386447 0.9990842

Here parallel specifies the number of cores.

1.3 PCA R Implementation

Principal Component Analysis (PCA) is a common tool used in exploratory data analysis for high-dimensional data. PCA is often involved with the calculation of covariance matrix, and the following R code implements the calculation proposed in (Patterson, Price, and Reich 2006). The user-defined function computes the covariance matrix for each variant and adds up to a total matrix s. The argument .progress=TRUE enables the display of progress information during the calculation.

# covariance variable with an initial value
s <- 0

seqApply(file, "$dosage", function(x)
    {
        p <- 0.5 * mean(x, na.rm=TRUE)      # allele frequency
        g <- (x - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
        g[is.na(g)] <- 0                    # correct missing values
        s <<- s + (g %o% g)                 # update the cov matrix s in the parent environment
    }, margin="by.variant", .progress=TRUE)

# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))

# eigen-decomposition
eig <- eigen(s)
[..................................................]  0%, ETC: --- 
[==================>...............................] 36%, ETC: 4.1m
...
[==================================================] 100%, completed in 14.3m
# covariance variable with an initial value
s <- 0

seqBlockApply(file, "$dosage", function(x)
    {
        p <- 0.5 * colMeans(x, na.rm=TRUE)     # allele frequencies (a vector)
        g <- (t(x) - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
        g[is.na(g)] <- 0                       # correct missing values
        s <<- s + crossprod(g)                 # update the cov matrix s in the parent environment
    }, margin="by.variant", .progress=TRUE)

# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))

# eigen-decomposition
eig <- eigen(s)
[..................................................]  0%, ETC: ---
[==================>...............................] 35%, ETC: 9s
[======================================>...........] 75%, ETC: 3s
[==================================================] 100%, completed in 14s

seqParallel() utilizes the facilities offered by the R parallel package to perform calculations within a cluster or SMP environment, and the genotypes are automatically split into non-overlapping parts. The parallel implementation with R is shown as follows, and the C optimized function is also available in the SNPRelate package.

# the datasets are automatically split into four non-overlapping parts
genmat <- seqParallel(2, file, FUN = function(f)
    {
        s <- 0  # covariance variable with an initial value
        seqBlockApply(f, "$dosage", function(x)
            {
                p <- 0.5 * colMeans(x, na.rm=TRUE)     # allele frequencies (a vector)
                g <- (t(x) - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
                g[is.na(g)] <- 0                       # correct missing values
                s <<- s + crossprod(g)                 # update the cov matrix s in the parent environment
            }, margin="by.variant")
        s  # output
    }, .combine = "+",    # sum "s" of different processes together
    split = "by.variant")

# scaled by the number of samples over the trace
genmat <- genmat * (nrow(genmat) / sum(diag(genmat)))

# eigen-decomposition
eig <- eigen(genmat, symmetric=TRUE)
# figure
plot(eig$vectors[,1], eig$vectors[,2], xlab="PC 1", ylab="PC 2")

More examples can be found: SeqArray Data Format and Access

1.4 Parallel Implementation

The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via seqParallelSetup() and distribute the calculations to multiple cores or even more than 100 cluster nodes.

# use 2 cores for demonstration
seqParallelSetup(2)
## Enable the computing cluster with 2 forked R processes.
# numbers of distinct alleles per site
table(seqNumAllele(file))
## 
##     2 
## 19773
# reference allele frequencies
summary(seqAlleleFreq(file, ref.allele=0L))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.9725  0.9963  0.9286  0.9991  1.0000
# close the cluster environment
seqParallelSetup(FALSE)
## Stop the computing cluster.

.

.

.

.

.

.

.

2 Bioconductor Features

2.1 GRanges and GRangesList

In this section, we illustrate how to work with Bioconductor core packages for performing common queries to retrieve data from a SeqArray file. The GRanges and GRangesList classes manipulate genomic range data and can be used in the function seqSetFilter() to define a data subset. For example, the annotation information of each exon, the coding range and transcript ID are stored in the TxDb.Hsapiens.UCSC.hg19.knownGene object for the UCSC known gene annotations on hg19.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
# get the exons grouped by gene
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txs <- exonsBy(txdb, "gene")

where exonsBy() returns a GRangesList object for all known genes in the database.

seqSetFilter(file, txs)  # define an exon filter
## # of selected variants: 1,050
# VCF export with exon variants
seqGDS2VCF(file, "exons.vcf.gz")
## Tue Jan  3 15:43:55 2017
## VCF Export: exons.vcf.gz
##     1,092 samples, 1,050 variants
##     INFO Field: <none>
##     FORMAT Field: <none>
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed in 0s    
## Tue Jan  3 15:43:55 2017    Done.

If random-access memory is sufficiently large, users could load all exon variants via seqGetData(file, "genotype"); otherwise, data have to be loaded by chunk or a user-defined function is applied over variants by seqApply().

2.2 VariantAnnotation

SeqArray can also export data with selected variants and samples as a VCF object for use with the VariantAnnotation package (Obenchain et al. 2014):

library(VariantAnnotation)
# select a region [10Mb, 30Mb] on chromosome 22
seqSetFilterChrom(file, 22, from.bp=10000000, to.bp=30000000)
## # of selected variants: 7,066
vcf <- seqAsVCF(file, chr.prefix="chr")
vcf
## class: CollapsedVCF 
## dim: 7066 1092 
## rowRanges(vcf):
##   GRanges with 9 metadata columns: ID, REF, ALT, QUAL, FILTER, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 0 columns: 
## geno(vcf):
##   SimpleList of length 1: GT
## geno(header(vcf)):
##       Number Type   Description
##    GT 1      String Genotype
locateVariants(vcf, txdb, CodingVariants())
## GRanges object with 524 ranges and 9 metadata columns:
##       seqnames               ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID         CDSID
##          <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList>
##     1    chr22 [17071862, 17071862]      - |   coding      1579      1579       128       74436        216505
##     2    chr22 [17073170, 17073170]      - |   coding       271       271       129       74436        216505
##     3    chr22 [17589225, 17589225]      + |   coding      1116      1116       377       73481        214034
##     4    chr22 [17601466, 17601466]      - |   coding       552       552       382       74444        216522
##     5    chr22 [17629357, 17629357]      - |   coding       424       424       394       74446        216528
##   ...      ...                  ...    ... .      ...       ...       ...       ...         ...           ...
##   520    chr22 [29913278, 29913278]      - |   coding      1567      1567      7023       74771        217273
##   521    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74768        217279
##   522    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74769        217279
##   523    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74770        217279
##   524    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74771        217279
##            GENEID       PRECEDEID        FOLLOWID
##       <character> <CharacterList> <CharacterList>
##     1      150160                                
##     2      150160                                
##     3       23765                                
##     4       27439                                
##     5       27440                                
##   ...         ...             ...             ...
##   520        8563                                
##   521        8563                                
##   522        8563                                
##   523        8563                                
##   524        8563                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

.

.

.

.

.

.

.

3 Integration with SeqVarTools

The SeqVarTools package is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets. The vignette of SeqVarTools is http://www.bioconductor.org/packages/release/bioc/vignettes/SeqVarTools/inst/doc/SeqVarTools.pdf.

3.1 Linear Regression

The SeqVarTools package extends SeqArray by providing methods for many tasks common to quality control and analysis of sequence data. Methods include: transition/transversion ratio, heterozygosity and homozygosity rates, singleton counts, Hardy-Weinberg equilibrium, Mendelian error checking, and linear and logistic regression. Additionally, SeqVarTools defines a new class to link the information present in the SeqArray file to additional sample and variant annotation provided by the user, such as sex and phenotype. One could select a subset of samples in a file and run a linear regression on all variants:

library(Biobase)
library(SeqVarTools)
data(KG_P1_SampData)
KG_P1_SampData
## An object of class 'AnnotatedDataFrame'
##   rowNames: 1 2 ... 1092 (1092 total)
##   varLabels: sample.id sex age phenotype
##   varMetadata: labelDescription
head(pData(KG_P1_SampData))  # show KG_P1_SampData
##   sample.id    sex age   phenotype
## 1   HG00096   male  55 -0.65582105
## 2   HG00097 female  34  1.28337670
## 3   HG00099 female  39  0.05563847
## 4   HG00100 female  62  0.11139003
## 5   HG00101   male  60  0.34933331
## 6   HG00102 female  28  0.36536723
# link sample data to SeqArray file
seqData <- SeqVarData(file, sample.data)

# set sample and variant filters
female <- sampleData(seqData)$sex == "female"
seqSetFilter(seqData, sample.sel=female) 

# run linear regression
res <- regression(seqData, outcome="phenotype", covar="age")
head(res)
##   variant.id   n      freq