This vignette describes steps of a basic analysis of ChIP-seq data. To exemplify this tutorial, we use ChIP-seq data for the lysine 27 acetylation of the histone H3 (i.e H3K27ac).
After completing this vignette, you will be able to:
1. Read a ChIP-seq experiment into R
2. Extend the reads and bin the data (details and relevance discussed later)
3. Generate .bedGraph files
4. Visualize ChIP-seq data with R
5. Perform basic analysis of ChIP-seq peaks
6. Generate average profiles and heatmaps of ChIP-seq enrichment around a set of annotated genomic loci
In the appendix part, we show how to download, preprocess and asses the quality of .fastq files.
H3K27ac is a histone modification associated with active promoters and enhancers. We downloaded data corresponding to a ChIP-seq experiment with two biological replicates of mouse Embryonic Stem cells (mESC) along with the input control sample Histone H3K27ac separates active from poised enhancers and predicts developmental state by Creyghton et al., PNAS 2010.
The first part of ChIP-seq analysis workflow consists in read preprocessing. We will not focus here on these first steps, we outline them and provide the code in the Appendix part of the vignette. The three major steps in the preprocessing are briefly outlined below.
Sequenced reads are saved in .fastq files. The very first step in the analyses of sequencing results consists in quality assessment. The R package ShortRead provides a qa to perform this analysis. The reader can find the code necessary to generate a HTML read quality control report in the Appendix part of the vignette.
Initial parts of the analysis of sequenced reads include: alignment, filtering and peak finding. They can be performed using tools such as Bowtie2, samtools or MACS. We provide all the necessary code in the Appendix part of the vignette.
Two steps from this vignette (visualization and read distribution analysis) require biomart database querying via the internet. In order to avoid many downloads, we provide the necessary objects in the data package EpigeneticsCSAMA. The code for the generation of these objects can be found in the Appendix of the vignette.
Additionally, in order to reduce memory requirements, we restrict our analysis to the filtered reads mapping to chromosome 6.
The data files produced by the steps above are placed in the R objects of a data package called EpigeneticsCSAMA, which we load now. (Note that such a data package is used for convenience in this course, but typically, you would not package up interemediate data in this way.)
library(EpigeneticsCSAMA)
dataDirectory = system.file("bedfiles", package="EpigeneticsCSAMA")
The variable dataDirectory shows the directory containing the data objects necessary for this vignette.
dataDirectory
## [1] "/home/msmith/Applications/R/R-devel/library/EpigeneticsCSAMA/bedfiles"
In order to explore this files, have a look at them with a text editor or via a terminal emulator.
We need to load the GenomicRanges, rtracklayer and IRanges packages. To read the .bam file to R, we use the import.bed function from the rtracklayer package. The result is a GRanges object. This is an extremely useful and powerful class of objects which the readers are already familiar with. Each filtered read is represented here as a genomic interval.
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(rtracklayer)
library(IRanges)
input = import.bed(file.path(dataDirectory, 'ES_input_filtered_ucsc_chr6.bed'))
rep1 = import.bed(file.path(dataDirectory, 'H3K27ac_rep1_filtered_ucsc_chr6.bed'))
rep2 = import.bed(file.path(dataDirectory, 'H3K27ac_rep2_filtered_ucsc_chr6.bed'))
The objects input, rep1 and rep2 hold the genomic annotation of the filtered reads for the input sample and ChIP-seq replicate 1 and replicate 2, respectively. We display the rep1 object. We see that the strand information, read name along with alignment score are included as information for each read.
rep1
## GRanges object with 481412 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr6 [ 85222462, 85222497] - | SRR066766.303
## [2] chr6 [134189439, 134189474] + | SRR066766.374
## [3] chr6 [ 47920826, 47920861] + | SRR066766.393
## [4] chr6 [143565148, 143565183] + | SRR066766.397
## [5] chr6 [ 39392692, 39392727] - | SRR066766.438
## ... ... ... ... . ...
## [481408] chr6 [ 86800209, 86800244] - | SRR066766.14657325
## [481409] chr6 [ 91422497, 91422532] - | SRR066766.14657340
## [481410] chr6 [ 54433776, 54433811] + | SRR066766.14657362
## [481411] chr6 [120020286, 120020321] - | SRR066766.14657382
## [481412] chr6 [ 85113322, 85113357] - | SRR066766.14657387
## score
## <numeric>
## [1] 40
## [2] 42
## [3] 42
## [4] 42
## [5] 42
## ... ...
## [481408] 42
## [481409] 42
## [481410] 42
## [481411] 42
## [481412] 42
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We see that we have roughly the same number of reads in the input and IP-ed experiments.
length(input)
## [1] 465823
length(rep1)
## [1] 481412
length(rep2)
## [1] 506287
The reads correspond to sequences at the end of each IP-ed fragment (single-end sequencing data). We need to extend these reads in order to represent each IP-ed DNA fragment.
We estimate the mean read length using the estimate.mean.fraglen function from chipseq packege. Next, we extend the reads to the inferred read length using the resize function. We remove any reads for which the coordinates, after the extension, exceed chromosome length. These three analysis steps are wrapped in a single function prepareChIPseq function which we define below.
library(chipseq)
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: Rsamtools
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
prepareChIPseq = function(reads){
frag.len = median( estimate.mean.fraglen(reads) )
cat( paste0( 'Median fragment size for this library is ', round(frag.len)))
reads.extended = resize(reads, width = frag.len)
return( trim(reads.extended) )
}
We next apply it to the input and ChIP-seq samples.
input = prepareChIPseq( input )
## Median fragment size for this library is 236
rep1 = prepareChIPseq( rep1 )
## Median fragment size for this library is 122
rep2 = prepareChIPseq( rep2 )
## Median fragment size for this library is 107
Compare with above to see how rep1 has changed.
rep1
## GRanges object with 481412 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr6 [ 85222376, 85222497] - | SRR066766.303
## [2] chr6 [134189439, 134189560] + | SRR066766.374
## [3] chr6 [ 47920826, 47920947] + | SRR066766.393
## [4] chr6 [143565148, 143565269] + | SRR066766.397
## [5] chr6 [ 39392606, 39392727] - | SRR066766.438
## ... ... ... ... . ...
## [481408] chr6 [ 86800123, 86800244] - | SRR066766.14657325
## [481409] chr6 [ 91422411, 91422532] - | SRR066766.14657340
## [481410] chr6 [ 54433776, 54433897] + | SRR066766.14657362
## [481411] chr6 [120020200, 120020321] - | SRR066766.14657382
## [481412] chr6 [ 85113236, 85113357] - | SRR066766.14657387
## score
## <numeric>
## [1] 40
## [2] 42
## [3] 42
## [4] 42
## [5] 42
## ... ...
## [481408] 42
## [481409] 42
## [481410] 42
## [481411] 42
## [481412] 42
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The next step in the analysis is to count how many reads map to each of the pre-established genomic intervals (bins).
We will tile the genome into non-overlapping bins of size 200 bp. To this end we need the information about chromosome sizes in the mouse genome (assembly mm9). In the data package, we provide the object si (strand information), which holds these data. The reader can find the code necessary to create the si object in the Obtaining si* object for *mm9** of the Appendix.
data(si)
si
## Seqinfo object with 21 sequences from mm9 genome:
## seqnames seqlengths isCircular genome
## chr1 197195432 FALSE mm9
## chr2 181748087 FALSE mm9
## chr3 159599783 FALSE mm9
## chr4 155630120 FALSE mm9
## chr5 152537259 FALSE mm9
## ... ... ... ...
## chr17 95272651 FALSE mm9
## chr18 90772031 FALSE mm9
## chr19 61342430 FALSE mm9
## chrX 166650296 FALSE mm9
## chrY 15902555 FALSE mm9
Next, we use the tileGenome function from the GenomicRanges package to generate a GRanges object with intervals covering the genome in tiles (bins) of size of 200 bp.
binsize = 200
bins = tileGenome(si['chr6'], tilewidth=binsize,
cut.last.tile.in.chrom=TRUE)
bins
## GRanges object with 747586 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr6 [ 1, 200] *
## [2] chr6 [201, 400] *
## [3] chr6 [401, 600] *
## [4] chr6 [601, 800] *
## [5] chr6 [801, 1000] *
## ... ... ... ...
## [747582] chr6 [149516201, 149516400] *
## [747583] chr6 [149516401, 149516600] *
## [747584] chr6 [149516601, 149516800] *
## [747585] chr6 [149516801, 149517000] *
## [747586] chr6 [149517001, 149517037] *
## -------
## seqinfo: 1 sequence from mm9 genome
We now count how many reads fall into each bin. For this purpose, we define the function BinChIPseq. It takes two arguments, reads and bins which are GRanges objects.
BinChIPseq = function( reads, bins ){
mcols(bins)$score = countOverlaps( bins, reads )
return( bins )
}
Now we apply it to the objects input, rep1 and rep2. We obtain input.200bins, rep1.200bins and rep2.200bins, which are GRanges objects that contain the binned read coverage of the input and ChIP-seq experiments.
input.200bins = BinChIPseq( input, bins )
rep1.200bins = BinChIPseq( rep1, bins )
rep2.200bins = BinChIPseq( rep2, bins )
rep1.200bins
## GRanges object with 747586 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr6 [ 1, 200] * | 0
## [2] chr6 [201, 400] * | 0
## [3] chr6 [401, 600] * | 0
## [4] chr6 [601, 800] * | 0
## [5] chr6 [801, 1000] * | 0
## ... ... ... ... . ...
## [747582] chr6 [149516201, 149516400] * | 0
## [747583] chr6 [149516401, 149516600] * | 0
## [747584] chr6 [149516601, 149516800] * | 0
## [747585] chr6 [149516801, 149517000] * | 0
## [747586] chr6 [149517001, 149517037] * | 0
## -------
## seqinfo: 1 sequence from mm9 genome
We can plot coverage for 1000 bins, starting from bin 200,000.
plot( 200000:201000, rep1.200bins$score[200000:201000],
xlab="chr6", ylab="counts per bin" )
Below, we will see more sophisticated ways of plotting coverage.
At this step of the analysis, the data is ready to be visualized and shared. One of the most common means of sharing ChIP-seq data is to generate .wig, .binWig or .bedGraph files. They are memory and size-efficient files holding the information about the signal along the genome. Moreover, these files can be visualized in genome browsers such as IGV and IGB. We show how to export the binned data as a .bedGraph file.
export(input.200bins,
con='input_chr6.bedGraph',
format = "bedGraph")
export(rep1.200bins,
con='H3K27ac_rep1_chr6.bedGraph',
format = "bedGraph")
export(rep2.200bins,
con='H3K27ac_rep2_chr6.bedGraph',
format = "bedGraph")
In the next section, we see how to visualize bedGraph files using R.
Now, we have data which we would like to display along the genome. R offers a flexible infrastructure for visualisation of many types of genomics data. Here, we use the Gviz package for these purposes.
library(Gviz)
## Loading required package: grid
##
## Attaching package: 'Gviz'
## The following objects are masked from 'package:ShortRead':
##
## chromosome, position
The principle of working with Gviz relies on the generation of tracks which can be, for example ChIP-seq signal along the genome, ChIP-seq peaks, gene models or any kind of other data such as annotation of CpG islands in the genome. We start with loading the gene models for chromosome 6 starting at position 122,530,000 and ending at position 122,900,000. We focus on this region as it harbors the Nanog gene, which is stongly expressed in ES cells.
We obtain the annotation using biomaRt package. Work with biomaRt package relies on querying the biomart database. In the Appendix, we show how to obtain gene models for protein coding genes for the archive mouse genome assembly (mm9) and how to generate the bm object holding the annotation of all the RefSeq genes.
data(bm)
bm
## GeneRegionTrack 'RefSeq'
## | genome: mm9
## | active chromosome: chr6
## | annotation features: 102
We include the GenomeAxisTrack object which is a coordinate axis showing the genomic span of the analyzed region.
AT = GenomeAxisTrack( )
We plot the result using the plotTracks function. We choose the region to zoom into with the from and to arguments. The transcriptAnnotation argument allows to put the gene symbols in the plot.
plotTracks(c( bm, AT),
from=122530000, to=122900000,
transcriptAnnotation="symbol", window="auto",
cex.title=1, fontsize=10 )
We next add our two data tracks to the plot. We first generate DataTrack objects with DataTrack function. We include the information about how the track is be labaled and colored. We obtain input.track, rep1.track and rep2.track objects.
input.track = DataTrack(input.200bins,
strand="*", genome="mm9", col.histogram='gray',
fill.histogram='black', name="Input", col.axis="black",
cex.axis=0.4, ylim=c(0,150))
rep1.track = DataTrack(rep1.200bins,
strand="*", genome="mm9", col.histogram='steelblue',
fill.histogram='black', name="Rep. 1", col.axis='steelblue',
cex.axis=0.4, ylim=c(0,150))
rep2.track = DataTrack(rep2.200bins,
strand="*", genome="mm9", col.histogram='steelblue',
fill.histogram='black', name="Rep. 2", col.axis='steelblue',
cex.axis=0.4, ylim=c(0,150))
Finally, we plot these tracks along with the genomic features. We observe a uniform coverage in the case of the input track and pronounced peaks of enrichment H3K27ac in promoter and intergenic regions. Importantly, H3K27ac enriched regions are easily identified.
plotTracks(c(input.track, rep1.track, rep2.track, bm, AT),
from=122530000, to=122900000,
transcriptAnnotation="symbol", window="auto",
type="histogram", cex.title=0.7, fontsize=10 )
ChIP-seq experiments are designed to isolate regions enriched in a factor of interest. The identification of enriched regions, often refered to as peak finding, is an area of research by itself. There are many algorithms and tools used for peak finding. The choice of a method is strongly motivated by the kind of factor analyzed. For instance, transcription factor ChIP-seq yield well defined narrow peaks whereas histone modifications ChIP-seq experiments such as H3K36me3 yield extended regions of high coverage. Finally, ChIP-seq with antobodies recognizing polymerase II result in narrow peaks combined with extended regions of enrichment.
As we saw in the previous section of the tutorial, H3K27ac mark shows well defined peaks. In such a case, MACS is one of the most commonly used software for peak finding. ChIP-seq peak calling can also be done in R with the BayesPeak package. However, we stick here to the most common approach and use MACS. We ran MACS for you and provide the result in the data package. You can find the code necessary to obtain the peaks in the Appendix of the vignette.
We next import the .bed files of the isolated peaks from the data package.
peaks.rep1 = import.bed(file.path(dataDirectory,'Rep1_peaks_ucsc_chr6.bed'))
peaks.rep2 = import.bed(file.path(dataDirectory,'Rep2_peaks_ucsc_chr6.bed'))
First step in the analysis of the identified peaks is to simply display them in the browser, along with the ChIP-seq and input tracks. To this end, we use AnnotationTrack function. We display peaks as boxes colored in blue.
peaks1.track = AnnotationTrack(peaks.rep1,
genome="mm9", name='Peaks Rep. 1',
chromosome='chr6',
shape='box',fill='blue3',size=2)
peaks2.track = AnnotationTrack(peaks.rep2,
genome="mm9", name='Peaks Rep. 2',
chromosome='chr6',
shape='box',fill='blue3',size=2)
We visualise the Nanog locus.
plotTracks(c(input.track, rep1.track, peaks1.track,
rep2.track, peaks2.track, bm, AT),
from=122630000, to=122700000,
transcriptAnnotation="symbol", window="auto",
type="histogram", cex.title=0.7, fontsize=10 )
We can see that MACS has succesfully identified regions showing high H3K27ac signal. We see that both biological replicates agree well, however, in some cases peaks are called only in one sample. In the next section, we will analyse how often do we see the overlap between peaks and isolate reproducible peaks.
We first find the overlap between the peak sets of the two replicates.
ovlp = findOverlaps( peaks.rep1, peaks.rep2 )
ovlp
## Hits object with 2528 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 2 2
## [3] 3 3
## [4] 4 4
## [5] 5 5
## ... ... ...
## [2524] 3025 3025
## [2525] 3026 3026
## [2526] 3029 3027
## [2527] 3030 3028
## [2528] 3031 3030
## -------
## queryLength: 3032 / subjectLength: 3032
If a peak in one replicate overlaps with mutiple peaks in the other replicate, it will appear multiple times in ovlp. To see, how many peaks overlap with something in the other replicate, we count the number of unique peaks in each of the two columns of ovlp and take the smaller of these two counts to as the number of common peaks.
ov = min( length(unique( queryHits(ovlp) )), length(unique( subjectHits(ovlp) ) ) )
We draw this as a Venn diagram, using the draw.pairwise.venn function from the VennDiagram package.
library(VennDiagram)
## Loading required package: futile.logger
draw.pairwise.venn(
area1=length(peaks.rep1),
area2=length(peaks.rep2),
cross.area=ov,
category=c("rep1", "rep2"),
fill=c("steelblue", "blue3"),
cat.cex=0.7)
## (polygon[GRID.polygon.100], polygon[GRID.polygon.101], polygon[GRID.polygon.102], polygon[GRID.polygon.103], text[GRID.text.104], text[GRID.text.105], text[GRID.text.106], text[GRID.text.107], text[GRID.text.108])
We will focus only on peaks identified in both replicates (hereafter refered to as enriched areas). The enriched areas are colored in green.
enriched.regions = Reduce(subsetByOverlaps, list(peaks.rep1, peaks.rep2))
enr.reg.track = AnnotationTrack(enriched.regions,
genome="mm9", name='Enriched regions',
chromosome='chr6',
shape='box',fill='green3',size=2)
plotTracks(c(input.track, rep1.track, peaks1.track,
rep2.track, peaks2.track, enr.reg.track,
bm, AT),
from=122630000, to=122700000,
transcriptAnnotation="symbol", window="auto",
type="histogram", cex.title=0.5, fontsize=10 )