CSAMA 2017: Remote data, remote computing

June 15, 2017

Basic idea

We've made some assumptions in the basic curriculum

local computing
- you have control over the whole system
local data
- all the data are on your machine
data in memory
functions return results relatively quickly
results are digestible in concise form

Exceptions to these basic assumptions

huge data
- pan-cancer, TCGA, 1000 genomes and beyond
algorithms with very large RAM requirements
- alignments; fragment-level bias modeling for RNA-seq (alpine)
analysis with very large numbers of tests
- all SNPs vs all (transcripts, histone marks, …)
- long runtimes, intermittent failures, massive outputs
can Bioconductor practices still be followed?

What are the practices to be followed? (a selected few!)

self-documenting data structures
- integrate assay, sample-level, and experiment-level data
functional object-oriented programming
- X[G, S] is a faithful restriction/reorganization of X and retains type/behavior of X
- critical elements of metadata such as genome build are bound tightly to entities whose interpretation require knowledge of this
all operations are evaluations of R functions
R packages are fundamental for organizing, documenting, and testing
- software, annotation, and data

Example: GWAS catalog

library(gwascat)
data(ebicat38)
ebicat38

## gwasloc instance with 36761 records and 37 attributes per record.
## Extracted:  2017-05-20 
## Genome:  GRCh38 
## Excerpt:
## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames                 ranges strand | DISEASE/TRAIT        SNPS
##          <Rle>              <IRanges>  <Rle> |   <character> <character>
##   [1]        1 [203186754, 203186754]      * | YKL-40 levels   rs4950928
##   [2]       13 [ 39776775,  39776775]      * |     Psoriasis   rs7993214
##   [3]       15 [ 78513681,  78513681]      * |   Lung cancer   rs8034191
##   [4]        1 [159711078, 159711078]      * |   Lung cancer   rs2808630
##   [5]        3 [190632672, 190632672]      * |   Lung cancer   rs7626795
##         P-VALUE
##       <numeric>
##   [1]     1e-13
##   [2]     2e-06
##   [3]     3e-18
##   [4]     7e-06
##   [5]     8e-06
##   -------
##   seqinfo: 23 sequences from GRCh38 genome

Beyond local computing 1: 1000 genomes in the cloud

library(ldblock)
path(s3_1kg("22"))

## [1] "http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"

# can use scanVcfHeader, 
#      readVcf(..., ScanVcfParam(which=[GRanges]))

Beyond local computing 2: TCGA in BigTable, thanks to ISB/NCI

#package cgcR has isbApp(bq)
> getBQ
function () 
{
    library(dplyr)
    library(bigrquery)
    con <- DBI::dbConnect(dbi_driver(), project = "isb-cgc", 
        dataset = "tcga_201607_beta", billing = "...")
    con
}

Beyond local computing 3: shinyapps.io

Conclusions

Bioconductor site includes considerable documentation on how to build bioc-enabled clusters in EC2
New technical options are emerging,
Required: Curating the environments/strategies so that bioconductor principles (self-identifying, packaging with test disciplines, …) are preserved as we work