Differential gene- and exon-level expression analyses for RNA-seq data using edgeR, voom and featureCounts

Mark D. Robinson, University of Zurich
31.07.2014

Rough plan

  • Gene-level RNA-seq DE analysis with edgeR: counts -> diagnostics -> statistics (15 min)
  • Hitchhiker's guide to RNA-seq gene-level DE (15 min)
  • Exercise 1: use Example 2 data (15 min)
  • Break (10 min)
  • Counting: BAMs -> counts (5 min)
  • Exercise 2: Count BAMs from pasillaSubset (15 min)
  • Repeat gene-level analysis with voom (10 min)
  • Exon-level differential analysis with diffSplice NEW
  • Exercise 3: Play with voom and/or diffSplice

Let's get started

  • Load edgeR
library("edgeR")
  • To avoid clashes, start with fresh session (save anything you need first)
rm(list=ls())

Example dataset 1 (preprocessed)

Example dataset 1 (details)

ls()
[1] "counts"
head(counts[,1:7],3)
                ES.07985 DE.07981 GT.66339 FG.08004 PE.07980 FE.66350
ENSG00000254468        0        0        0        1        0        0
ENSG00000177951       44       50       24       37       38       41
ENSG00000188076        0        0        0        0        0        0
                ES.66342
ENSG00000254468        1
ENSG00000177951      162
ENSG00000188076        0

Example dataset 1 (details)

dim(counts)
[1] 30727    27
grp <- as.factor(substr(colnames(counts),1,2))
table(grp)
grp
DE ES FE FG GT IS PE PH 
 3  3  4  3  3  2  6  3 

Example dataset 2 [NOT RUN]

setwd("/data/RNA_SEQ_edgeR_voom_featureCounts/")
load("ds2.Rdata")
setwd("~")

Example dataset 2 [NOT RUN]

  • How does the data come (Example 2)?
ls()
head(counts,3)
class(counts)
md
class(md)
grp <- md$condition

Simple explorations

  • Good place to start: scatter plots
o <- order(grp)
pairs(log2(1+counts[,o[1:7]]), pch=".",lower.panel=NULL)

plot of chunk pairsplot

Create a DGEList container, normalize

  • Specify: grouping variable, counts
d <- DGEList(counts=counts, group=grp)
  • Normalize: TMM
d <- calcNormFactors(d)
d$samples
         group lib.size norm.factors
ES.07985    ES   769440       0.9721
DE.07981    DE   947798       1.0560
GT.66339    GT   548227       0.8668
FG.08004    FG   726088       1.2161
PE.07980    PE   688903       1.0602
FE.66350    FE   521573       1.1357
ES.66342    ES  2281754       0.9036
DE.66333    DE  2834208       0.9781
GT.66341    GT  1596550       0.7663
FG.66346    FG  2070020       1.1650
PE.66344    PE  1978504       1.0385
FE.66331    FE  1517276       1.0618
PH.66332    PH   840380       1.3760
PH.66336    PH   852548       1.3735
PE.66345    PE   915042       1.2394
PE.66330    PE   938173       1.2296
FE.66348    FE   629687       1.1131
FE.66334    FE   647313       1.1129
ES.66335    ES  2048439       0.7027
DE.66349    DE  1843746       0.7489
GT.66337    GT  1761089       0.8636
FG.66351    FG  1417106       0.9454
PE.66338    PE  2317354       0.8958
PH.66340    PH  4087566       1.1389
PE.66329    PE  3700620       1.0793
IS.66347    IS 12583285       0.5621
IS.66343    IS 33286768       0.9297

Filter low-expressed genes

  • Filter (e.g., at least 1 CPM in at least small group of replicates)
dim(d)
[1] 30727    27
cps <- cpm(d)
k <- rowSums(cps>=1)>2
d <- d[k,]
dim(d)
[1] 21707    27

Always start with an MDS (or similar) plot

  • Filter (e.g., at least 1 CPM in at least small group of replicates)
cols <- as.numeric(d$samples$group)
plotMDS(d,col=cols)

plot of chunk stdmdsplot

Some useful variations within MDS plots

  • Filter (e.g., at least 1 CPM in at least small group of replicates)
par(mfrow=c(2,2))
plotMDS(d,col=cols, main="500 / lLFC")
plotMDS(d,col=cols, method="bcv", main="500 / BCV")
plotMDS(d,col=cols, top=2000, main="2000 / lLFC")
plotMDS(d,col=cols, top=2000, method="bcv", main="2000 / BCV")