Work flow

1. Experimental design

Keep it simple

Replicate

Avoid confounding experimental factors with other factors

Record co-variates

Be aware of batch effects

HapMap samples from one facility, ordered by date of processing.

2. Wet-lab

Confounding factors

Artifacts of your particular protocols

3. Sequencing

Axes of variation

Application-specific, e.g.,

4. Alignment

Alignment strategies

Splice-aware aligners (and Bioconductor wrappers)

5. Reduction to ‘count tables’

6. Analysis

Unique statistical aspects

Summarization

Normalization

Appropriate error model

Pre-filtering

Borrowing information

7. Comprehension

Placing differentially expressed regions in context

Copy number / expression QC Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.

Experimental and statistical issues in depth

Normalization

DESeq2 estimateSizeFactors(), Anders and Huber, 2010

edgeR calcNormFactors() TMM method of Robinson and Oshlack, 2010

Dispersion

DESeq2 estimateDispersions()

edgeR estimateDisp()

Analysis of designed experiments in R

Example: t-test

t.test()

head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
plot(extra ~ group, data = sleep)

## Traditional interface
with(sleep, t.test(extra[group == 1], extra[group == 2]))
## 
##  Welch Two Sample t-test
## 
## data:  extra[group == 1] and extra[group == 2]
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean of x mean of y 
##      0.75      2.33
## Formula interface
t.test(extra ~ group, sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33
## equal variance between groups
t.test(extra ~ group, sleep, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.363874  0.203874
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

lm() and anova()

## linear model; compare to t.test(var.equal=TRUE)
fit <- lm(extra ~ group, sleep)
anova(fit)
## Analysis of Variance Table
## 
## Response: extra
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## group      1 12.482 12.4820  3.4626 0.07919 .
## Residuals 18 64.886  3.6048                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## underlying model, used in `lm.fit()`
model.matrix(extra ~ group, sleep)     # last column indicates group effect
##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      0
## 8            1      0
## 9            1      0
## 10           1      0
## 11           1      1
## 12           1      1
## 13           1      1
## 14           1      1
## 15           1      1
## 16           1      1
## 17           1      1
## 18           1      1
## 19           1      1
## 20           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
model.matrix(extra ~ 0 + group, sleep) # contrast between columns
##    group1 group2
## 1       1      0
## 2       1      0
## 3       1      0
## 4       1      0
## 5       1      0
## 6       1      0
## 7       1      0
## 8       1      0
## 9       1      0
## 10      1      0
## 11      0      1
## 12      0      1
## 13      0      1
## 14      0      1
## 15      0      1
## 16      0      1
## 17      0      1
## 18      0      1
## 19      0      1
## 20      0      1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
fit0 <- lm(extra ~ ID, sleep)
fit1 <- lm(extra ~ ID + group, sleep)
anova(fit0, fit1)
## Analysis of Variance Table
## 
## Model 1: extra ~ ID
## Model 2: extra ~ ID + group
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     10 19.290                                
## 2      9  6.808  1    12.482 16.501 0.002833 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE)
## 
##  Paired t-test
## 
## data:  extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

genefilter::rowttests()

Limitations

Consequences

Common experimental designs

Practical: RNA-Seq gene-level differential expression

Adapted from Love, Anders, and Huber’s Bioconductor work flow

Michael Love [1], Simon Anders [2], Wolfgang Huber [2]

[1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US;

[2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

1. Experimental design

The data used in this workflow is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

2, 3, and 4: Wet lab, sequencing, and alignment

5. Reduction

We use the airway package to illustrate reduction. The package provides sample information, a subset of eight BAM files, and the known gene models required to count the reads.

library(airway)
path <- system.file(package="airway", "extdata")
dir(path)
##  [1] "GSE52778_series_matrix.txt"       
##  [2] "Homo_sapiens.GRCh37.75_subset.gtf"
##  [3] "SRR1039508_subset.bam"            
##  [4] "SRR1039509_subset.bam"            
##  [5] "SRR1039512_subset.bam"            
##  [6] "SRR1039513_subset.bam"            
##  [7] "SRR1039516_subset.bam"            
##  [8] "SRR1039517_subset.bam"            
##  [9] "SRR1039520_subset.bam"            
## [10] "SRR1039521_subset.bam"            
## [11] "SraRunInfo_SRP033351.csv"         
## [12] "sample_table.csv"

Setup

The ingredients for counting include are:

  1. Metadata describing samples. Read this using read.csv().
csvfile <- dir(path, "sample_table.csv", full=TRUE)
sampleTable <- read.csv(csvfile, row.names=1)
head(sampleTable)
##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
  1. BAM files containing aligned reads. Create an object that references these files. What does the yieldSize argument mean?
library(Rsamtools)
filenames <- dir(path, ".bam$", full=TRUE)
bamfiles <- BamFileList(filenames, yieldSize=1000000)
names(bamfiles) <- sub("_subset.bam", "", basename(filenames))
  1. Known gene models. These might come from an existing TxDb package, or created from biomart or UCSC, or from a GTF file. We’ll take the hard road, making a TxDb object from the GTF file used to align reads and using the TxDb to get all exons, grouped by gene.
library(GenomicFeatures)
gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf")
txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())
## Prepare the 'metadata' data frame ... metadata: OK
genes <- exonsBy(txdb, by="gene")

Counting

After these preparations, the actual counting is easy. The function summarizeOverlaps() from the GenomicAlignments package will do this. This produces a SummarizedExperiment object, which contains a variety of information about an experiment

library(GenomicAlignments)
se <- summarizeOverlaps(features=genes, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE)
colData(se) <- as(sampleTable, "DataFrame")
se
## class: SummarizedExperiment 
## dim: 20 8 
## exptData(0):
## assays(1): counts
## rownames(20): ENSG00000009724 ENSG00000116649 ... ENSG00000271794
##   ENSG00000271895
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
rowData(se)
## Warning: 'rowData' is deprecated.
## Use 'rowRanges' instead.
## See help("Deprecated")
## GRangesList object of length 20:
## $ENSG00000009724 
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames               ranges strand   |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
##    [1]        1 [11086580, 11087705]      -   |        98 ENSE00000818830
##    [2]        1 [11090233, 11090307]      -   |        99 ENSE00000472123
##    [3]        1 [11090805, 11090939]      -   |       100 ENSE00000743084
##    [4]        1 [11094885, 11094963]      -   |       101 ENSE00000743085
##    [5]        1 [11097750, 11097868]      -   |       103 ENSE00003520086
##    ...      ...                  ...    ... ...       ...             ...
##   [14]        1 [11106948, 11107176]      -   |       111 ENSE00003467404
##   [15]        1 [11106948, 11107176]      -   |       112 ENSE00003489217
##   [16]        1 [11107260, 11107280]      -   |       113 ENSE00001833377
##   [17]        1 [11107260, 11107284]      -   |       114 ENSE00001472289
##   [18]        1 [11107260, 11107290]      -   |       115 ENSE00001881401
## 
## ...
## <19 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(assay(se))
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000009724         38         28         66         24         42
## ENSG00000116649       1004       1255       1122       1313       1100
## ENSG00000120942        218        256        233        252        269
## ENSG00000120948       2751       2080       3353       1614       3519
## ENSG00000171819          4         50         19        543          1
## ENSG00000171824        869       1075       1115       1051        944
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000009724         41         47         36
## ENSG00000116649       1879        745       1536
## ENSG00000120942        465        207        400
## ENSG00000120948       3716       2220       1990
## ENSG00000171819         10         14       1067
## ENSG00000171824       1405        748       1590

6. Analysis using DESeq2

The previous section illustrates the reduction step on a subset of the data; here’s the full data set

data(airway)
se <- airway

This object contains an informative colData slot – prepared as described in the airway vignette. In particular, the colData() include columns describing the cell line cell and treatment dex for each sample

colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

DESeq2 makes the analysis particularly easy, simply add the experimental design, run the pipeline, and extract the results

library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)

Simple visualizations / sanity checks include

Many additional diagnostic approaches are described in the DESeq2 (and edgeR) vignettes, and in the RNA-seq gene differential expression work flow.

7. Comprehension

see Part E, Gene Set Enrichment