2013-09-30 ~ 2013-10-01
This intermediate course is directed at R / Bioconductor users who, in an effort to get the most out of high-throughput sequence and other analyses, want to understand more about how R and Bioconductor work. (1) The course begins by reviewing R data types, memory management, and other aspects of internal computation. We use this as a basis for understanding how to writing, debug, and assess the performance of efficient R code, including straight-forward approaches to iteration, vectorization, and parallel evaluation. (2) We then explore R objects, especially the S4 object system. We learn about how to specify simple and more complicated S4 objects, and how to implement essential methods for single and multiple dispatch. We use insights from performance and the S4 class system to explore strategies for efficient representation of large structured data, especially the classes in the IRanges, GenomicRanges, VariantAnnotation, and Biostrings packages. (3) Availability of programming libraries (such as samtools) or performance needs may sometimes point to use of C or C++ code integrated into R. We develop some simple C functions, and explore use of Rcpp as a relatively painless way to incorporate C code. We take a brief look at R's internal data representations, and explore how to debug and profile C code. (4) Finally, we investigate how R can be used to interact with other important resources: data bases; web sites; and visualization facilities like shiny. Use of some of these facilities is illustrated by packages such as AnnotationDbi and biomaRt.