1 R
2 Bioconductor
3 End matter
- 3.1 Session Info
- 3.2 Acknowledgements

Author: Martin Morgan
Date: 22 July, 2019

1 R

1.1 History of R and CRAN

Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of S. CRAN started in 1997.
‘Free’ software: no cost, open source, broad use.
Extensible: packages (15,000 on CRAN, 1750 on Bioconductor)
Key features
- Intrinsic statistical concepts
- Vectorized computation
- ‘Old-school’ scripts rather than graphical user interface – great for reproducibility!
- (Advanced) copy-on-change semanatics

1.2 Vectors and data frames

1 + 2

## [1] 3

x = c(1, 2, 3)
1:3             # sequence of integers from 1 to 3

## [1] 1 2 3

x + c(4, 5, 6)  # vectorized

## [1] 5 7 9

x + 4           # recycling

## [1] 5 6 7

Vectors

numeric(), character(), logical(), integer(), complex(), …
NA: ‘not available’
factor(): values from restricted set of ‘levels’.

Operations

numeric: ==, <, <=, >, >=, …
logical: | (or), & (and), ! (not)
subset: [, e.g., x[c(2, 3)]
assignment: [<-, e.g., x[c(1, 3)] = x[c(1, 3)]
other: is.na()

Functions

x = rnorm(100)
y = x + rnorm(100)
plot(x, y)

Many!

data.frame

df <- data.frame(Independent = x, Dependent = y)
head(df)

##   Independent  Dependent
## 1  -0.4338047 -0.5779168
## 2  -0.2769985 -1.0665115
## 3  -1.6966211 -1.8769578
## 4  -0.6481076 -0.9540841
## 5  -2.1015776 -1.1166887
## 6   0.7109163 -0.3363154

df[1:5, 1:2]

##   Independent  Dependent
## 1  -0.4338047 -0.5779168
## 2  -0.2769985 -1.0665115
## 3  -1.6966211 -1.8769578
## 4  -0.6481076 -0.9540841
## 5  -2.1015776 -1.1166887

df[1:5, ]

##   Independent  Dependent
## 1  -0.4338047 -0.5779168
## 2  -0.2769985 -1.0665115
## 3  -1.6966211 -1.8769578
## 4  -0.6481076 -0.9540841
## 5  -2.1015776 -1.1166887

plot(Dependent ~ Independent, df)  # 'formula' interface

List of equal-length vectors
Vectors can be of different type
Two-dimensional subset and assignment
Column access: df[, 1], df[, "Indep"], df[[1]], df[["Indep"]], df$Indep

Exercise: plot only values with Dependent > 0, Independent > 0

Select rows

ridx <- (df$Dependent > 0) & (df$Independent > 0)

Plot subset

plot(Dependent ~ Independent, df[ridx, ])

Skin the cat another way

plot(
    Dependent ~ Independent, df,
    subset = (Dependent > 0) & (Independent > 0)
)

1.3 Analysis: functions, classes, methods

fit <- lm(Dependent ~ Independent, df)  # linear model -- regression
anova(fit)                              # summary table

## Analysis of Variance Table
## 
## Response: Dependent
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## Independent  1  92.664  92.664   70.32 3.787e-13 ***
## Residuals   98 129.139   1.318                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(Dependent ~ Independent, df)
abline(fit)

lm(): plain-old function
fit: an object of class “lm”
anova(): a generic with a specific method for class “lm”

class(fit)

## [1] "lm"

methods(class="lm")

##  [1] add1           alias          anova          case.names    
##  [5] coerce         confint        cooks.distance deviance      
##  [9] dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula       
## [17] hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print         
## [29] proj           qr             residuals      rstandard     
## [33] rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

1.4 Help!

?"plot"          # plain-old-function or generic
?"plot.formula"  # method
?"plot.lm"       # method for object of class 'lm', plot(fit)

1.5 Packages

library(ggplot2)
ggplot(df, aes(x = Independent, y = Dependent)) +
    geom_point() + geom_smooth(method = "lm")

General purpose: >15,000 packages on CRAN
Gain contributor’s domain expertise and weird (or other) idiosyncracies
Installation (once only per computer) versus load (via library(ggplot2), once per session)

2 Bioconductor

Started 2002 as a platform for understanding analysis of microarray data

2.1 Packages

1,750 packages. Domains of expertise:

Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
Microarrays (methylation, expression, copy number, …)
flow cytometry
proteomics
…

Important themes

Reproducible research
Interoperability between packages & work kflows
Usability

Resources

https://bioconductor.org
https://bioconductor.org/packages – software, annotation, experiment, workflow
https://support.bioconductor.org
Community slack (sign-up)

2.2 Objects

A distinctive feature of Bioconductor – use of objects for representing data

library(Biostrings)
dna <- DNAStringSet(c("AACTCC", "CTGCA"))
dna

##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 AACTCC
## [2]     5 CTGCA

reverseComplement(dna)

##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 GGAGTT
## [2]     5 TGCAG

Biostrings: DNA, RNA, AA representation and manipulation
GenomicRanges: Coordinates in genome space
SummarizedExperiment: coordinating ‘assay’ data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments).

2.3 High-throughput sequence work flow

Web site, https://bioconductor.org

1750 ‘software’ packages, https://bioconductor.org/packages

Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell)
Microarrays (methylation, copy number, classical expression, …)
Annotation (more about annotations later this morning…)
Flow cytometry
Proteomics, image analysis, …

Discovery and use, e.g., DESeq2

Landing pages: title, description (abstract), installation instructions, badges
Vignettes!

Also:

‘Annotation’ packages
‘Experiment data’ packages
Workflows
Course material, …

3 End matter

3.1 Session Info

sessionInfo()

## R version 3.6.1 Patched (2019-07-16 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] Biostrings_2.53.2   XVector_0.25.0      IRanges_2.19.10    
## [4] S4Vectors_0.23.17   BiocGenerics_0.31.5 ggplot2_3.2.0      
## [7] BiocStyle_2.13.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1         pillar_1.4.2       compiler_3.6.1    
##  [4] BiocManager_1.30.4 zlibbioc_1.31.0    tools_3.6.1       
##  [7] digest_0.6.20      evaluate_0.14      tibble_2.1.3      
## [10] gtable_0.3.0       pkgconfig_2.0.2    rlang_0.4.0       
## [13] yaml_2.2.0         xfun_0.8           withr_2.1.2       
## [16] stringr_1.4.0      dplyr_0.8.3        knitr_1.23        
## [19] grid_3.6.1         tidyselect_0.2.5   glue_1.3.1        
## [22] R6_2.4.0           rmarkdown_1.14     bookdown_0.12     
## [25] purrr_0.3.2        magrittr_1.5       scales_1.0.0      
## [28] codetools_0.2-16   htmltools_0.3.6    assertthat_0.2.1  
## [31] colorspace_1.4-1   labeling_0.3       stringi_1.4.3     
## [34] lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

3.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)

Lecture 1 – Introduction to R / Bioconductor

22 July 2019

Contents

1 R