Contents

1 About

1.1 Bioconductor: Analysis and comprehension of high-throughput

genomic data

  • Statistical analysis: large data, technological artifacts, designed experiments; rigorous
  • Comprehension: biological context, visualization, reproducibility
  • High-throughput
    • Sequencing: RNASeq, ChIPSeq, variants, copy number, …
    • Microarrays: expression, SNP, …
    • Flow cytometry, proteomics, images, …

1.2 Packages, vignettes, work flows

  • 1296 software packages; also…

    • ‘Annotation’ packages – static data bases of identifier maps, gene models, pathways, etc; e.g., TxDb.Hsapiens.UCSC.hg19.knownGene
    • ’Experiment packages – data sets used to illustrate software functionality, e.g., airway
  • Discover and navigate via biocViews
  • Package ‘landing page’

    • Title, author / maintainer, short description, citation, installation instructions, …, download statistics
  • All user-visible functions have help pages, most with runnable examples
  • ‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
  • ‘Release’ (every six months) and ‘devel’ branches
  • Support site; videos, recent courses

1.3 Package installation and use

  • A package needs to be installed once, using the instructions on the package landing page (e.g., DESeq2).

    source("https://bioconductor.org/biocLite.R")
    biocLite(c("DESeq2", "org.Hs.eg.db"))
  • biocLite() installs Bioconductor, CRAN, and github packages.

  • Once installed, the package can be loaded into an R session

    library(GenomicRanges)

    and the help system queried interactively, as outlined above:

    help(package="GenomicRanges")
    vignette(package="GenomicRanges")
    vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
    ?GRanges

2 Key concepts

2.1 Goals

  • Reproducibility
  • Interoperability
  • Use

2.2 What a few lines of R has to say

x <- rnorm(1000)
y <- x + rnorm(1000)
df <- data.frame(X=x, Y=y)
plot(Y ~ X, df)
fit <- lm(Y ~ X, df)
anova(fit)
## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## X           1  999.71  999.71  929.56 < 2.2e-16 ***
## Residuals 998 1073.32    1.08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
abline(fit)

2.3 Classes and methods – “S3”

  • data.frame()

    • Defines class to coordinate data
    • Creates an instance or object
  • plot(), lm(), anova(), abline(): methods defined on generics to transform instances

  • Discovery and help

    class(fit)
    methods(class=class(fit))
    methods(plot)
    ?"plot"
    ?"plot.formula"
  • tab completion!

2.4 Bioconductor classes and methods – “S4”

  • Example: working with DNA sequences

    library(Biostrings)
    dna <- DNAStringSet(c("AACAT", "GGCGCCT"))
    reverseComplement(dna)
    ##   A DNAStringSet instance of length 2
    ##     width seq
    ## [1]     5 ATGTT
    ## [2]     7 AGGCGCC
    data(phiX174Phage)
    phiX174Phage
    ##   A DNAStringSet instance of length 6
    ##     width seq                                            names               
    ## [1]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA Genbank
    ## [2]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA RF70s
    ## [3]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA SS78
    ## [4]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA Bull
    ## [5]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA G97
    ## [6]  5386 GAGTTTTATCGCTTCCATGACG...GATTGGCGTATCCAACCTGCA NEB03
    letterFrequency(phiX174Phage, "GC", as.prob=TRUE)
    ##            G|C
    ## [1,] 0.4476420
    ## [2,] 0.4472707
    ## [3,] 0.4472707
    ## [4,] 0.4470850
    ## [5,] 0.4472707
    ## [6,] 0.4470850
  • Discovery and help

    class(dna)
    ?"DNAStringSet-class"
    ?"reverseComplement,DNAStringSet-method"

3 High-throughput sequence analysis work flows

4 Bioconductor sequencing ecosystem