.

.

Phenome-wide association study (PheWAS) is known to be a powerful tool in discovery and replication of genetic association studies. The recent development in the UK Biobank resource with deep genomic and phenotyping data has provided unparalleled research opportunities. To reduce the computational complexity and cost of PheWAS, the SAIGE (scalable and accurate implementation of generalized mixed model [1]) method was proposed recently, controlling for case-control imbalance and sample structure in single variant association studies. However, it is still computationally challenging to analyze the associations of thousands of phenotypes with whole-genome variant data, especially for disease diagnoses using the ICD-10 codes of deep phenotyping.

Here we develop a new high-performance statistical package (SAIGEgds) for large-scale PheWAS using mixed models [2]. In this package, the SAIGE method is implemented with optimized C++ codes taking advantage of sparse structure of genotype dosages. SAIGEgds supports efficient genomic data structure (GDS) files [3] including both integer genotypes and numeric imputed dosages. Benchmarks using the UKBiobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.

.

1 Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SAIGEgds")

The BiocManager::install() approach may require that you build from source, i.e. make and compilers must be installed on your system – see the R FAQ for your operating system; you may also need to install dependencies manually.

.

.

2 Examples

library(SAIGEgds)

# the genotype file in SeqArray GDS format (1000 Genomes Phase1, chromosome 22)
(geno_fn <- seqExampleFileName("KG_Phase1"))
## [1] "/home/biocbuild/bbs-3.10-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds"

2.1 Preparing SNP data for genetic relationship matrix

# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
gds <- seqOpen(geno_fn)

The LD pruning is provided by snpgdsLDpruning() in the pacakge SNPRelate:

library(SNPRelate)
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
set.seed(1000)
snpset <- snpgdsLDpruning(gds)
## SNV pruning based on LD:
## Calculating allele counts/frequencies ...
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed, 1s
## Excluding 122 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## Working space: 1,092 samples, 19,651 SNVs
##     using 1 (CPU) core
##     sliding window: 500,000 basepairs, Inf SNPs
##     |LD| threshold: 0.2
##     method: composite
## Chromosome 22: 37.70%, 7,455/19,773
## 7,455 markers are selected in total.
is(snpset)
## [1] "list"             "vector"           "list_OR_List"     "vector_OR_factor" "vector_OR_Vector"
names(snpset)
## [1] "chr22"
snpset.id <- unlist(snpset, use.names=FALSE)  # get the variant IDs of a LD-pruned set
head(snpset.id)
## [1] 1 3 4 6 7 8

Create a genotype file for genetic relationship matrix (GRM) using the LD-pruned SNP set:

grm_fn <- "grm_geno.gds"
seqSetFilter(gds, variant.id=snpset.id)
## # of selected variants: 7,455
# export to a GDS genotype file without annotation data
seqExport(gds, grm_fn, info.var=character(), fmt.var=character(), samp.var=character())
## Export to 'grm_geno.gds'
##     sample.id (1,092)  [md5: bd2a93b49750ae227793ed23c575b620]
##     variant.id (7,455)  [md5: 67e7665f2ea9a4132b49a9f5bd84fbae]
##     position  [md5: a62a3752b030b2742f45dc3a5e751e51]
##     chromosome  [md5: 6cc8b780b298da2fb8237da650af7fe8]
##     allele  [md5: 3ff1146981d1807658f5b0243c2d47b0]
##     genotype  [md5: 5bd348af9af0c1f14bbb75c2b9a8b7e6]
##     phase  [md5: 922eca75d5098e1b684dcf0bae70e86b]
##     annotation/id  [md5: 7cfc098439330617251554e95fbb2f79]
##     annotation/qual  [md5: ce8ebcb77b11f0c53309778b928a7393]
##     annotation/filter  [md5: 11301697d2af5964436ec4eb2294906f]
## Done.
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'grm_geno.gds' (412.9K)
##     # of fragments: 91
##     save to 'grm_geno.gds.tmp'
##     rename 'grm_geno.gds.tmp' (412.5K, reduced: 444B)
##     # of fragments: 54
# close the file
seqClose(gds)

.

2.2 Fitting the null model

A simulated phenotype data is used to demonstrate the model fitting:

set.seed(1000)
sampid <- seqGetData(grm_fn, "sample.id")  # sample IDs in the genotype file

pheno <- data.frame(
    sample.id = sampid,
    y  = sample(c(0, 1), length(sampid), replace=TRUE, prob=c(0.95, 0.05)),
    x1 = rnorm(length(sampid)),
    x2 = rnorm(length(sampid)),
    stringsAsFactors = FALSE)
head(pheno)
##   sample.id y          x1         x2
## 1   HG00096 0  0.02329127 -0.1700062
## 2   HG00097 0 -1.38629108 -1.0058530
## 3   HG00099 0  0.56339700 -0.6989284
## 4   HG00100 0 -0.25703246 -1.5884806
## 5   HG00101 0  0.02926429  0.7645377
## 6   HG00102 0 -0.44480750  1.9964501
# model fitting
glmm <- seqFitNullGLMM_SPA(y ~ x1 + x2, pheno, grm_fn, trait.type="binary", sample.col="sample.id")
## SAIGE association analysis:
## Tue Oct 29 23:53:45 2019
## Open the genotype file 'grm_geno.gds'
## Filtering variants:
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed, 1s
## Fit the null model: y ~ x1 + x2 + var(GRM)
##     # of samples: 1,092
##     # of variants: 2,079
##     using 1 thread
## Transform on the design matrix with QR decomposition:
##     new formula: y ~ x0 + x1 + x2 - 1
## Start loading SNP genotypes:
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed, 0s
##     using 1.5M (sparse matrix)
## Binary outcome: y
##        y Number Proportion
##        0   1034 0.94688645
##        1     58 0.05311355
## Initial fixed-effect coefficients:
##           x0          x1         x2
##     2.884823 -0.09445083 0.01380474
## Initial variance component estimates, tau:
##     Sigma_E: 1, Sigma_G: 0.49888
## Iteration 1:
##     tau: (1, 0.4988798)
##     fixed coeff: (2.884823, -0.09445083, 0.01380474)
## Final tau: (1, 0)
##     fixed coeff: (2.884823, -0.09445083, 0.01380474)
## Calculate the average ratio of variances:
## Tue Oct 29 23:53:46 2019
##      1, maf: 0.0101, var1: 0.0488, var2: 0.0488, ratio: 1.00000
##      2, maf: 0.0261, var1: 0.0544, var2: 0.0544, ratio: 1.00000
##      3, maf: 0.0797, var1: 0.0479, var2: 0.0479, ratio: 1.00000
##      4, maf: 0.2074, var1: 0.0402, var2: 0.0402, ratio: 1.00000
##      5, maf: 0.2967, var1: 0.0371, var2: 0.0371, ratio: 1.00000
##      6, maf: 0.1992, var1: 0.0454, var2: 0.0454, ratio: 1.00000
##      7, maf: 0.0101, var1: 0.049, var2: 0.049, ratio: 1.00000
##      8, maf: 0.4446, var1: 0.0298, var2: 0.0298, ratio: 1.00000
##      9, maf: 0.1268, var1: 0.0471, var2: 0.0471, ratio: 1.00000
##     10, maf: 0.2427, var1: 0.0305, var2: 0.0305, ratio: 1.00000
##     11, maf: 0.0206, var1: 0.0598, var2: 0.0598, ratio: 1.00000
##     12, maf: 0.0385, var1: 0.0532, var2: 0.0532, ratio: 1.00000
##     13, maf: 0.0183, var1: 0.0525, var2: 0.0525, ratio: 1.00000
##     14, maf: 0.2056, var1: 0.0455, var2: 0.0455, ratio: 1.00000
##     15, maf: 0.0096, var1: 0.0535, var2: 0.0535, ratio: 1.00000
##     16, maf: 0.0114, var1: 0.0487, var2: 0.0487, ratio: 1.00000
##     17, maf: 0.0408, var1: 0.049, var2: 0.049, ratio: 1.00000
##     18, maf: 0.1983, var1: 0.04, var2: 0.04, ratio: 1.00000
##     19, maf: 0.0490, var1: 0.0479, var2: 0.0479, ratio: 1.00000
##     20, maf: 0.0114, var1: 0.0515, var2: 0.0515, ratio: 1.00000
##     21, maf: 0.0188, var1: 0.0592, var2: 0.0592, ratio: 1.00000
##     22, maf: 0.1937, var1: 0.043, var2: 0.043, ratio: 1.00000
##     23, maf: 0.0128, var1: 0.0527, var2: 0.0527, ratio: 1.00000
##     24, maf: 0.4020, var1: 0.0348, var2: 0.0348, ratio: 1.00000
##     25, maf: 0.0316, var1: 0.0463, var2: 0.0463, ratio: 1.00000
##     26, maf: 0.0357, var1: 0.0517, var2: 0.0517, ratio: 1.00000
##     27, maf: 0.0989, var1: 0.0481, var2: 0.0481, ratio: 1.00000
##     28, maf: 0.0096, var1: 0.0585, var2: 0.0585, ratio: 1.00000
##     29, maf: 0.0275, var1: 0.0525, var2: 0.0525, ratio: 1.00000
##     30, maf: 0.1255, var1: 0.0455, var2: 0.0455, ratio: 1.00000
##     ratio avg. is 1, sd: 8.525292e-16
## Tue Oct 29 23:53:46 2019
## Done.

.

2.3 P-value calculations

# using 2 processes
assoc <- seqAssocGLMM_SPA(geno_fn, glmm, mac=10, parallel=2)
## Open '/home/biocbuild/bbs-3.10-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds' ...
## SAIGE association analysis:
##     # of samples: 1,092
##     # of variants: 19,773
##     p-value threshold for SPA adjustment: 0.05
##     variance ratio for approximation: 1
##     # of processes: 2
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed, 1s
## # of variants after filtering MAF/MAC: 9,371
## Done.
head(assoc)
##   id chr      pos       rs.id                ref alt     AF.alt AC.alt  num       beta        SE      pval
## 1  1  22 16051497 rs141578542                  A   G 0.30494505    666 1092  0.2214607 0.2194430 0.3128813
## 2  2  22 16059752 rs139717388                  G   A 0.05677656    124 1092  0.4227084 0.4240575 0.3188526
## 3  5  22 16060995   rs2843244                  G   A 0.06135531    134 1092  0.3044670 0.4100805 0.4578107
## 4  8  22 16166919             ATATTTTCTGCACATATT   A 0.01190476     26 1092 -1.0851502 0.8804521 0.2177653
## 5  9  22 16173887                             GT   G 0.03067766     67 1092 -0.8195449 0.5587294 0.1424302
## 6 10  22 16205515 rs144309057                  G   A 0.01098901     24 1092 -1.0826898 0.9340256 0.2463890
##   pval.noadj converged
## 1  0.3128813      TRUE
## 2  0.3188526      TRUE
## 3  0.4578107      TRUE
## 4  0.2177653      TRUE
## 5  0.1424302      TRUE
## 6  0.2463890      TRUE
# filtering based on p-value
assoc[assoc$pval < 5e-4, ]
##         id chr      pos      rs.id ref alt    AF.alt AC.alt  num     beta        SE         pval   pval.noadj
## 8422 17856  22 48507315 rs57751251   T   C 0.1666667    364 1092 0.971123 0.2555021 1.442051e-04 5.999175e-05
## 8423 17857  22 48509092  rs7292083   C   T 0.1625458    355 1092 1.008351 0.2585365 9.610252e-05 3.358045e-05
## 8630 18282  22 49060987  rs4925399   A   G 0.5929487   1295 1092 0.708757 0.1956659 2.920160e-04 2.363433e-04
##      converged
## 8422      TRUE
## 8423      TRUE
## 8630      TRUE

.

2.4 QQ plot for p-values

n <- nrow(assoc)
# expected p-values
exp.pval <- (1:n) / n
# observed p-values
obs.pval <- sort(assoc$pval)

plot(-log10(obs.pval), -log10(exp.pval), xlab="-log10(expected P)", ylab="-log10(observed P)", cex=2/3)
abline(0, 1, col="red", lty=2)

.

.

3 Session Information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SNPRelate_1.20.0 SAIGEgds_1.0.0   Rcpp_1.0.2       SeqArray_1.26.0  gdsfmt_1.22.0   
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.25             XVector_0.26.0         magrittr_1.5           GenomicRanges_1.38.0  
##  [5] BiocGenerics_0.32.0    zlibbioc_1.32.0        IRanges_2.20.0         rlang_0.4.1           
##  [9] stringr_1.4.0          GenomeInfoDb_1.22.0    tools_3.6.1            parallel_3.6.1        
## [13] xfun_0.10              htmltools_0.4.0        RcppParallel_4.4.4     yaml_2.2.0            
## [17] digest_0.6.22          crayon_1.3.4           GenomeInfoDbData_1.2.2 S4Vectors_0.24.0      
## [21] bitops_1.0-6           RCurl_1.95-4.12        evaluate_0.14          rmarkdown_1.16        
## [25] stringi_1.4.3          compiler_3.6.1         Biostrings_2.54.0      SPAtest_3.0.0         
## [29] stats4_3.6.1

.

4 References

  1. Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet (2018). Sep;50(9):1335-1341. DOI: 10.1038/s41588-018-0184-y
  2. Zheng X, Davis J.Wade, SAIGEgds – an efficient statistical tool for large-scale PheWAS with mixed models; (Abstract 1920276). The Annual Meeting of The American Society of Human Genetics (ASHG), Oct 15-19, 2019, Houston, US.
  3. Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D. SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics (2017). DOI: 10.1093/bioinformatics/btx145.

.

5 Also See

SeqArray: Data Management of Large-scale Whole-genome Sequence Variant Calls

SNPRelate: Parallel Computing Toolset for Relatedness and Principal Component Analysis of SNP Data