Author: Martin Morgan
Date: 22 July, 2019

# 1R

## 1.1 History of R and CRAN

• Statistical programming language. Concieved 1992, initial version 1996, stable beta version in 2000; an implementation of S. CRAN started in 1997.
• ‘Free’ software: no cost, open source, broad use.
• Extensible: packages (15,000 on CRAN, 1750 on Bioconductor)
• Key features
• Intrinsic statistical concepts
• Vectorized computation
• ‘Old-school’ scripts rather than graphical user interface – great for reproducibility!

## 1.2 Vectors and data frames

1 + 2
## [1] 3
x = c(1, 2, 3)
1:3             # sequence of integers from 1 to 3
## [1] 1 2 3
x + c(4, 5, 6)  # vectorized
## [1] 5 7 9
x + 4           # recycling
## [1] 5 6 7

Vectors

• numeric(), character(), logical(), integer(), complex(), …
• NA: ‘not available’
• factor(): values from restricted set of ‘levels’.

Operations

• numeric: ==, <, <=, >, >=, …
• logical: | (or), & (and), ! (not)
• subset: [, e.g., x[c(2, 3)]
• assignment: [<-, e.g., x[c(1, 3)] = x[c(1, 3)]
• other: is.na()

Functions

x = rnorm(100)
y = x + rnorm(100)
plot(x, y)

• Many!

data.frame

df <- data.frame(Independent = x, Dependent = y)
head(df)
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
## 6  -0.1622652 -1.0660857
df[1:5, 1:2]
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
df[1:5, ]
##   Independent  Dependent
## 1   0.8658385  0.8357491
## 2  -1.2530897 -2.3004453
## 3   0.6287058  1.8726218
## 4  -0.4357103 -1.7256617
## 5  -0.9183898 -0.8309443
plot(Dependent ~ Independent, df)  # 'formula' interface

• List of equal-length vectors
• Vectors can be of different type
• Two-dimensional subset and assignment
• Column access: df[, 1], df[, "Indep"], df[[1]], df[["Indep"]], df$Indep Exercise: plot only values with Dependent > 0, Independent > 0 1. Select rows ridx <- (df$Dependent > 0) & (df\$Independent > 0)
2. Plot subset

plot(Dependent ~ Independent, df[ridx, ])

3. Skin the cat another way

plot(
Dependent ~ Independent, df,
subset = (Dependent > 0) & (Independent > 0)
)

## 1.3 Analysis: functions, classes, methods

fit <- lm(Dependent ~ Independent, df)  # linear model -- regression
anova(fit)                              # summary table
## Analysis of Variance Table
##
## Response: Dependent
##             Df Sum Sq Mean Sq F value    Pr(>F)
## Independent  1 89.009  89.009  97.886 < 2.2e-16 ***
## Residuals   98 89.113   0.909
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Dependent ~ Independent, df)
abline(fit)

• lm(): plain-old function
• fit: an object of class “lm”
• anova(): a generic with a specific method for class “lm”
class(fit)
## [1] "lm"
methods(class="lm")
##  [1] add1           alias          anova          case.names
##  [5] coerce         confint        cooks.distance deviance
##  [9] dfbeta         dfbetas        drop1          dummy.coef
## [13] effects        extractAIC     family         formula
## [17] hatvalues      influence      initialize     kappa
## [21] labels         logLik         model.frame    model.matrix
## [25] nobs           plot           predict        print
## [29] proj           qr             residuals      rstandard
## [33] rstudent       show           simulate       slotsFromS3
## [37] summary        variable.names vcov
## see '?methods' for accessing help and source code

## 1.4 Help!

?"plot"          # plain-old-function or generic
?"plot.formula"  # method
?"plot.lm"       # method for object of class 'lm', plot(fit)

## 1.5 Packages

library(ggplot2)
ggplot(df, aes(x = Independent, y = Dependent)) +
geom_point() + geom_smooth(method = "lm")

• General purpose: >15,000 packages on CRAN
• Gain contributor’s domain expertise and weird (or other) idiosyncracies
• Installation (once only per computer) versus load (via library(ggplot2), once per session)

# 2Bioconductor

Started 2002 as a platform for understanding analysis of microarray data

## 2.1 Packages

1,750 packages. Domains of expertise:

• Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
• Microarrays (methylation, expression, copy number, …)
• flow cytometry
• proteomics

Important themes

• Reproducible research
• Interoperability between packages & work kflows
• Usability

Resources

## 2.2 Objects

A distinctive feature of Bioconductor – use of objects for representing data

library(Biostrings)
dna <- DNAStringSet(c("AACTCC", "CTGCA"))
dna
##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 AACTCC
## [2]     5 CTGCA
reverseComplement(dna)
##   A DNAStringSet instance of length 2
##     width seq
## [1]     6 GGAGTT
## [2]     5 TGCAG
• Biostrings: DNA, RNA, AA representation and manipulation
• GenomicRanges: Coordinates in genome space
• SummarizedExperiment: coordinating ‘assay’ data (e.g., counts from an RNASeq experiment) with row and column annotations (e.g., information about samples and experimental treatments).

## 2.3 High-throughput sequence work flow

Web site, https://bioconductor.org

1750 ‘software’ packages, https://bioconductor.org/packages

• Sequence analysis (RNASeq, ChIPSeq, called variants, copy number, single cell)
• Microarrays (methylation, copy number, classical expression, …)
• Annotation (more about annotations later this morning…)
• Flow cytometry
• Proteomics, image analysis, …

Discovery and use, e.g., DESeq2

• Landing pages: title, description (abstract), installation instructions, badges
• Vignettes!

Also:

• ‘Annotation’ packages
• ‘Experiment data’ packages
• Workflows
• Course material, …

