June 15, 2017

Basic idea

We've made some assumptions in the basic curriculum

  • local computing
    • you have control over the whole system
  • local data
    • all the data are on your machine
  • data in memory
  • functions return results relatively quickly
  • results are digestible in concise form

Exceptions to these basic assumptions

  • huge data
    • pan-cancer, TCGA, 1000 genomes and beyond
  • algorithms with very large RAM requirements
    • alignments; fragment-level bias modeling for RNA-seq (alpine)
  • analysis with very large numbers of tests
    • all SNPs vs all (transcripts, histone marks, …)
    • long runtimes, intermittent failures, massive outputs
  • can Bioconductor practices still be followed?

What are the practices to be followed? (a selected few!)

  • self-documenting data structures
    • integrate assay, sample-level, and experiment-level data
  • functional object-oriented programming
    • X[G, S] is a faithful restriction/reorganization of X and retains type/behavior of X
    • critical elements of metadata such as genome build are bound tightly to entities whose interpretation require knowledge of this
  • all operations are evaluations of R functions
  • R packages are fundamental for organizing, documenting, and testing
    • software, annotation, and data

Example: GWAS catalog

library(gwascat)
data(ebicat38)
ebicat38
## gwasloc instance with 36761 records and 37 attributes per record.
## Extracted:  2017-05-20 
## Genome:  GRCh38 
## Excerpt:
## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames                 ranges strand | DISEASE/TRAIT        SNPS
##          <Rle>              <IRanges>  <Rle> |   <character> <character>
##   [1]        1 [203186754, 203186754]      * | YKL-40 levels   rs4950928
##   [2]       13 [ 39776775,  39776775]      * |     Psoriasis   rs7993214
##   [3]       15 [ 78513681,  78513681]      * |   Lung cancer   rs8034191
##   [4]        1 [159711078, 159711078]      * |   Lung cancer   rs2808630
##   [5]        3 [190632672, 190632672]      * |   Lung cancer   rs7626795
##         P-VALUE
##       <numeric>
##   [1]     1e-13
##   [2]     2e-06
##   [3]     3e-18
##   [4]     7e-06
##   [5]     8e-06
##   -------
##   seqinfo: 23 sequences from GRCh38 genome

Beyond local computing 1: 1000 genomes in the cloud

library(ldblock)
path(s3_1kg("22"))
## [1] "http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
# can use scanVcfHeader, 
#      readVcf(..., ScanVcfParam(which=[GRanges]))

Beyond local computing 2: TCGA in BigTable, thanks to ISB/NCI

#package cgcR has isbApp(bq)
> getBQ
function () 
{
    library(dplyr)
    library(bigrquery)
    con <- DBI::dbConnect(dbi_driver(), project = "isb-cgc", 
        dataset = "tcga_201607_beta", billing = "...")
    con
}

Beyond local computing 3: shinyapps.io

Conclusions

  • Bioconductor site includes considerable documentation on how to build bioc-enabled clusters in EC2
  • New technical options are emerging,
  • Required: Curating the environments/strategies so that bioconductor principles (self-identifying, packaging with test disciplines, …) are preserved as we work