Single sample
Designed experiments
MACS: Model-based Analysis for ChIP-Seq, Liu et al., 2008
ID | Mode | Library | Operation |
---|---|---|---|
1 | Single-sample | Individual | Union |
2 | Single-sample | Individual | Intersection |
3 | Single-sample | Individual | At least 2 |
4 | Single-sample | Pooled over group | Union |
5 | Single-sample | Pooled over group | Intersection |
6 | Two-sample | Pooled over group | Union |
7 | Single-sample | Pooled over all | - |
How to choose? – Lun & Smyth,
ID | Error rate | ||
---|---|---|---|
0.01 | 0.05 | 0.1 | |
RA | 0.010 (0.000) | 0.051 (0.001) | 0.100 (0.002) |
1 | 0.002 (0.000) | 0.019 (0.001) | 0.053 (0.001) |
2 | 0.003 (0.000) | 0.030 (0.000) | 0.073 (0.001) |
3 | 0.006 (0.000) | 0.042 (0.001) | 0.092 (0.001) |
4 | 0.033 (0.001) | 0.145 (0.001) | 0.261 (0.002) |
5 | 0.000 (0.000) | 0.001 (0.000) | 0.005 (0.000) |
6 | 0.088 (0.006) | 0.528 (0.013) | 0.893 (0.006) |
7 | 0.010 (0.000) | 0.049 (0.001) | 0.098 (0.001) |
## 100,000 t-tests under the null, n = 6
n <- 6; m <- matrix(rnorm(n * 100000), ncol=n)
P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value
quantile(P, c(.001, .01, .05))
## 0.1% 1% 5%
## 0.000954 0.010309 0.049512
hist(P, breaks=20)
This exercise is based on the csaw vignette, where more detail can be found.
The experiment involves changes in binding profiles of the NFYA protein between embryonic stem cells and terminal neurons. It is a subset of the data provided by Tiwari et al. 2012 available as GSE25532. There are two es (embryonic stem cell) and two tn (terminal neuron) replicates. Single-end FASTQ files were extracted from GEO, aligned using Rsubread, and post-processed (sorted and indexed) using Rsamtools with the script available at
file.path("ChIPSeq", "NFYA", "scripts", "preprocess.R")
Create a data frame summarizing the files used.
acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417",
tn_2="SRR074418")
files <- data.frame(Treatment=sub("_.*", "", names(acc)),
Replicate=sub(".*_", "", names(acc)),
sra=sprintf("%s.sra", acc),
fastq=sprintf("%s.fastq.gz", acc),
bam=sprintf("%s.fastq.gz.subread.BAM", acc),
row.names=acc, stringsAsFactors=FALSE)
Change to the directory where the BAM files are located
setwd("ChIPSeq/NFYA/bam")
Load the csaw library and count reads in overlapping windows. This returns a SummarizedExperiment
, so explore it a bit…
library(csaw)
library(GenomicRanges)
frag.len <- 110
system.time({
data <- windowCounts(files$bam, width=10, ext=frag.len)
}) # 156 seconds
acc <- sub(".fastq.*", "", data$bam.files)
colData(data) <- cbind(files[acc,], colData(data))
For further discussion about the reduction step, see Chapter 2 of the csaw vignette.
Filtering (vignette Chapter 3) Start by filtering low-count windows. There are likely to be many of these (how many?). Is there a rational way to choose the filtering threshold?
library(edgeR) # for aveLogCPM()
keep <- aveLogCPM(assay(data)) >= -1
data <- data[keep,]
Normalization (composition bias) (vignette Chapter 4) csaw uses binned counts in normalization. The bins are large relative to the ChIP peaks, on the assumption that the bins primarily represent non-differentially bound regions. The sample bin counts are normalized using the edgeR TMM (trimmed median of M values) method seen in the RNASeq differential expression lab. Explore vignette chapter 4 for more on normalization (this is a useful resource when seeking to develop normalization methods for other protocols!).
system.time({
binned <- windowCounts(files$bam, bin=TRUE, width=10000)
}) #139 second
normfacs <- normalize(binned)
Experimental design and Differential binding (vignette Chapter 5) Differential binding will be assessed using edgeR, where we need to specify the experimental design
design <- model.matrix(~Treatment, colData(data))
Apply a standard edgeR work flow to identify differentially bound regions. Creatively explore the results.
y <- asDGEList(data, norm.factors=normfacs)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
results <- glmQLFTest(fit, contrast=c(0, 1))
head(results$table)
## logFC logCPM F PValue
## 1 1.029 -2.38 0.39466 0.52986
## 2 1.052 -2.39 0.41573 0.51907
## 3 0.118 -2.16 0.00698 0.93341
## 4 -0.847 -1.79 0.50517 0.47724
## 5 -5.885 -2.15 7.98493 0.00472
## 6 -5.629 -2.31 6.58683 0.01027
Multiple testing (vignette Chapter 6) The challenge is that FDR across all detected differentially bound regions is what one is interested in, but what is immediately available is the FDR across differentially bound windows; region will often consist of multiple overlapping windows. As a first step, we’ll take a ‘quick and dirty’ approach to identifying regions by merging ‘high-abundance’ windows that are within, e.g., 1kb of one another
merged <- mergeWindows(rowRanges(data), tol=1000L)
Combine test results across windows within regions. Several strategies are explored in section 6.5 of the vignette.
tabcom <- combineTests(merged$id, results$table)
head(tabcom)
## nWindows logFC.up logFC.down PValue FDR direction
## 1 2 2 0 0.5299 0.999 up
## 2 6 0 5 0.0106 0.999 down
## 3 10 1 5 0.7301 0.999 down
## 4 7 5 2 0.0689 0.999 up
## 5 3 1 0 0.9728 0.999 mixed
## 6 1 0 1 0.3816 0.999 down
Section 6.6 of the vignette discusses approaches to identifying the ‘best’ windows within regions.
Finally, create a GRanges
object that summarized the merged windows and combined test statistics.
final <- merged$region
mcols(final) <- as(tabcom, "DataFrame")
As an example of how results can be anntoted
library(org.Mm.eg.db)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
anno <- detailRanges(
final, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
orgdb=org.Mm.eg.db, promoter=c(3000, 1000), dist=5000
)
mcols(final) <- cbind(mcols(final), DataFrame(anno))
The ‘top table’ of most differentially expressed regions can be obtained by ordering final
by the PValue
column, perhaps first filtering to remove regions that do not overlap known promoters.
annotated <- final[nzchar(final$overlap)]
annotated[order(annotated$PValue)]
## GRanges object with 279384 ranges and 9 metadata columns:
## seqnames ranges strand | nWindows
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr4 [ 70373201, 70379210] * | 37
## [2] chr19 [ 23357351, 23359160] * | 15
## [3] chr6 [103649001, 103650160] * | 10
## [4] chrUn_JH584304 [ 51, 70160] * | 1245
## [5] chr9 [ 3034201, 3038460] * | 49
## ... ... ... ... . ...
## [279380] chr13 [ 38165651, 38166460] * | 4
## [279381] chr5 [118141051, 118144410] * | 16
## [279382] chr6 [108497701, 108502260] * | 17
## [279383] chr8 [ 76776951, 76777110] * | 4
## [279384] chr5 [123945901, 123950210] * | 27
## logFC.up logFC.down PValue FDR direction
## <integer> <integer> <numeric> <numeric> <character>
## [1] 12 12 1.07e-52 9.95e-48 down
## [2] 4 8 1.91e-31 6.90e-27 down
## [3] 0 8 4.14e-26 9.96e-22 down
## [4] 5 1179 1.13e-25 2.53e-21 down
## [5] 0 49 2.43e-24 4.77e-20 down
## ... ... ... ... ... ...
## [279380] 1 0 1 1 mixed
## [279381] 4 0 1 1 mixed
## [279382] 0 5 1 1 mixed
## [279383] 0 0 1 1 mixed
## [279384] 9 7 1 1 mixed
## overlap left right
## <character> <character> <character>
## [1] Cdk5rap2|5|- Cdk5rap2|4|-[981]
## [2] Mamdc2|7|- Mamdc2|8|-[3904] Mamdc2|6|-[4550]
## [3] Chl1|7|+ Chl1|6|+[1672]
## [4] Pisd-ps3|0-11|-
## [5] Mir101c|0|- Mir101c|1|-[209]
## ... ... ... ...
## [279380] Dsp|I|+ Dsp|2|+[785] Dsp|3-4|+[1056]
## [279381] Fbxw8|2|-
## [279382] Itpr1|58|+ Itpr1|57|+[3808] Itpr1|59|+[3617]
## [279383] Gm10649|I|-
## [279384] Ccdc62|6|+ Ccdc62|5|+[1596] Ccdc62|7-9|+[989]
## -------
## seqinfo: 66 sequences from an unspecified genome