# Contents

Author: Martin Morgan
Date: 22 July, 2019

# 1R

## 1.1 History of R and CRAN

• Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of S. CRAN started in 1997.
• ‘Free’ software: no cost, open source, broad use.
• Extensible: packages (15,000 on CRAN, 1750 on Bioconductor)
• Key features
• Intrinsic statistical concepts
• Vectorized computation
• ‘Old-school’ scripts rather than graphical user interface – great for reproducibility!

## 1.2 Vectors and data frames

1 + 2
## [1] 3
x = c(1, 2, 3)
1:3             # sequence of integers from 1 to 3
## [1] 1 2 3
x + c(4, 5, 6)  # vectorized
## [1] 5 7 9
x + 4           # recycling
## [1] 5 6 7

Vectors

• numeric(), character(), logical(), integer(), complex(), …
• NA: ‘not available’
• factor(): values from restricted set of ‘levels’.

Operations

• numeric: ==, <, <=, >, >=, …
• logical: | (or), & (and), ! (not)
• subset: [, e.g., x[c(2, 3)]
• assignment: [<-, e.g., x[c(1, 3)] = x[c(1, 3)]
• other: is.na()

Functions

x = rnorm(100)
y = x + rnorm(100)
plot(x, y)

• Many!

data.frame

df <- data.frame(Independent = x, Dependent = y)
head(df)
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
## 6  -0.1622652 -1.0660857
df[1:5, 1:2]
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
df[1:5, ]
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
plot(Dependent ~ Independent, df)  # 'formula' interface

• List of equal-length vectors
• Vectors can be of different type
• Two-dimensional subset and assignment
• Column access: df[, 1], df[, "Indep"], df[[1]], df[["Indep"]], df$Indep Exercise: plot only values with Dependent > 0, Independent > 0 1. Select rows ridx <- (df$Dependent > 0) & (df\$Independent > 0)
2. Plot subset

plot(Dependent ~ Independent, df[ridx, ])

3. Skin the cat another way

plot(
Dependent ~ Independent, df,
subset = (Dependent > 0) & (Independent > 0)
)

## 1.3 Analysis: functions, classes, methods

fit <- lm(Dependent ~ Independent, df)  # linear model -- regression
anova(fit)                              # summary table
## Analysis of Variance Table
##
## Response: Dependent
##             Df Sum Sq Mean Sq F value    Pr(>F)
## Independent  1 89.009  89.009  97.886 < 2.2e-16 ***
## Residuals   98 89.113   0.909
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Dependent ~ Independent, df)
abline(fit)

• lm(): plain-old function
• fit: an object of class “lm”
• anova(): a generic with a specific method for class “lm”
class(fit)
## [1] "lm"
methods(class="lm")
##  [1] add1           alias          anova          case.names
##  [5] coerce         confint        cooks.distance deviance
##  [9] dfbeta         dfbetas        drop1          dummy.coef
## [13] effects        extractAIC     family         formula
## [17] hatvalues      influence      initialize     kappa
## [21] labels         logLik         model.frame    model.matrix
## [25] nobs           plot           predict        print
## [29] proj           qr             residuals      rstandard
## [33] rstudent       show           simulate       slotsFromS3
## [37] summary        variable.names vcov
## see '?methods' for accessing help and source code

## 1.4 Help!

?"plot"          # plain-old-function or generic
?"plot.formula"  # method
?"plot.lm"       # method for object of class 'lm', plot(fit)

## 1.5 Packages

library(ggplot2)
ggplot(df, aes(x = Independent, y = Dependent)) +
geom_point() + geom_smooth(method = "lm")

• General purpose: >15,000 packages on CRAN
• Gain contributor’s domain expertise and weird (or other) idiosyncracies
• Installation (once only per computer) versus load (via library(ggplot2), once per session)

# 2Bioconductor

Started 2002 as a platform for understanding analysis of microarray data

## 2.1 Packages

1,750 packages. Domains of expertise:

• Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
• Microarrays (methylation, expression, copy number, …)
• flow cytometry
• proteomics

Important themes

• Reproducible research
• Interoperability between packages & work kflows
• Usability

Resources

## 2.2 Objects

A distinctive feature of Bioconductor – use of objects for representing data

library(Biostrings)
dna <- DNAStringSet(c("AACTCC", "CTGCA"))
dna
##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 AACTCC
## [2]     5 CTGCA
reverseComplement(dna)
##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 GGAGTT
## [2]     5 TGCAG
• Biostrings: DNA, RNA, AA representation and manipulation
• GenomicRanges: Coordinates in genome space
• SummarizedExperiment: coordinating ‘assay’ data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments).

## 2.3 High-throughput sequence work flow

Web site, https://bioconductor.org

1750 ‘software’ packages, https://bioconductor.org/packages

• Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell)
• Microarrays (methylation, copy number, classical expression, …)
• Annotation (more about annotations later this morning…)
• Flow cytometry
• Proteomics, image analysis, …

Discovery and use, e.g., DESeq2

• Landing pages: title, description (abstract), installation instructions, badges
• Vignettes!

Also:

• ‘Annotation’ packages
• ‘Experiment data’ packages
• Workflows
• Course material, …

# 3 End matter

## 3.1 Session Info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS:   /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
## [1] Biostrings_2.52.0   XVector_0.24.0      IRanges_2.18.1
## [4] S4Vectors_0.22.0    BiocGenerics_0.30.0 ggplot2_3.2.0
## [7] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1         pillar_1.4.2       compiler_3.6.0
##  [4] BiocManager_1.30.4 zlibbioc_1.30.0    tools_3.6.0
##  [7] digest_0.6.20      evaluate_0.13      tibble_2.1.3
## [10] gtable_0.3.0       pkgconfig_2.0.2    rlang_0.4.0
## [13] yaml_2.2.0         xfun_0.7           withr_2.1.2
## [16] stringr_1.4.0      dplyr_0.8.1        knitr_1.23
## [19] grid_3.6.0         tidyselect_0.2.5   glue_1.3.1
## [22] R6_2.4.0           rmarkdown_1.12     bookdown_0.10
## [25] purrr_0.3.2        magrittr_1.5       scales_1.0.0
## [28] codetools_0.2-16   htmltools_0.3.6    assertthat_0.2.1
## [31] colorspace_1.4-1   labeling_0.3       stringi_1.4.3
## [34] lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

## 3.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)