In this manual, we will show how to use the methylKit package. methylKit is an R package for analysis and annotation of DNA methylation information obtained by high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants. But, it can potentially handle whole-genome bisulfite sequencing data if proper input format is provided.
DNA methylation in vertebrates typically occurs at CpG dinucleotides, however non-CpG Cs are also methylated in certain tissues such as embryonic stem cells. DNA methylation can act as an epigenetic control mechanism for gene regulation. Methylation can hinder binding of transcription factors and/or methylated bases can be bound by methyl-binding-domain proteins which can recruit chromatin remodeling factors. In both cases, the transcription of the regulated gene will be effected. In addition, aberrant DNA methylation patterns have been associated with many human malignancies and can be used in a predictive manner. In malignant tissues, DNA is either hypo-methylated or hyper-methylated compared to the normal tissue. The location of hyper- and hypo-methylated sites gives a distinct signature to many diseases. Traditionally, hypo-methylation is associated with gene transcription (if it is on a regulatory region such as promoters) and hyper-methylation is associated with gene repression.
Bisulfite sequencing is a technique that can determine DNA methylation patterns. The major difference from regular sequencing experiments is that, in bisulfite sequencing DNA is treated with bisulfite which converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. By sequencing and aligning those converted DNA fragments it is possible to call methylation status of a base. Usually, the methylation status of a base determined by a high-throughput bisulfite sequencing will not be a binary score, but it will be a percentage. The percentage simply determines how many of the bases that are aligning to a given cytosine location in the genome have actual C bases in the reads. Since bisulfite treatment leaves methylated Cs intact, that percentage will give us percent methylation score on that base. The reasons why we will not get a binary response are:
We start by reading in the methylation call data from bisulfite
sequencing with methRead
function. Reading in the data this
way will return a methylRawList
object which stores
methylation information per sample for each covered base. By default
methRead
requires a minimum coverage of 10 reads per base
to ensure good quality of the data and a high confidence methylation
percentage.
The methylation call files are basically text files that contain
percent methylation score per base. Such input files may be obtained
from AMP pipeline
developed for aligning RRBS reads or from processBismarkAln
function. However, “cytosineReport” and “coverage” files from Bismark
aligner can be read in to methylKit as well.
A typical methylation call file looks like this:
## chrBase chr base strand coverage freqC freqT
## 1 chr21.9764539 chr21 9764539 R 12 25.00 75.00
## 2 chr21.9764513 chr21 9764513 R 12 0.00 100.00
## 3 chr21.9820622 chr21 9820622 F 13 0.00 100.00
## 4 chr21.9837545 chr21 9837545 F 11 0.00 100.00
## 5 chr21.9849022 chr21 9849022 F 124 72.58 27.42
Most of the time bisulfite sequencing experiments have test and
control samples. The test samples can be from a disease tissue while the
control samples can be from a healthy tissue. You can read a set of
methylation call files that have test/control conditions giving
treatment
vector option. For sake of subsequent analysis,
file.list, sample.id and treatment option should have the same order. In
the following example, first two files have the sample ids “test1” and
“test2” and as determined by treatment vector they belong to the same
group. The third and fourth files have sample ids “ctrl1” and “ctrl2”
and they belong to the same group as indicated by the treatment
vector.
NOTE: Throughout the vignette we will be accessing files that are part of the methylKit package with the
system.file("extdata", "xyz", package = "methylKit")
notation. However when analyzing your own data you should not use this notation, rather write full file paths yourself or use functions likefile.path
,list.files
etc. to create relative or absolute paths.
library(methylKit)
file.list=list( system.file("extdata",
"test1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"test2.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control1.myCpG.txt", package = "methylKit"),
system.file("extdata",
"control2.myCpG.txt", package = "methylKit") )
# read the files to a methylRawList object: myobj
myobj=methRead(file.list,
sample.id=list("test1","test2","ctrl1","ctrl2"),
assembly="hg18",
treatment=c(1,1,0,0),
context="CpG",
mincov = 10
)
In addition to the options we mentioned above, any tab separated text file with a generic format can be read in using methylKit, such as methylation ratio files from BSMAP. See here for an example.
Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.
Therefore methylKit offers a new group of classes, that are basically
pendants to the original methylKit classes with one important
difference: The methylation information, which normally is internally
stored as data.frame, is stored in an external bgzipped file and is
indexed by tabix (Li 2011), to enable fast
retrieval of records or regions. This group contains
methylRawListDB
, methylRawDB
,
methylBaseDB
and methylDiffDB
, let us call
them methylDB objects.
We can now create a methylRawListDB
object, which stores
the same content as myobj from above. But the single
methylRaw
objects retrieve their data from the tabix-file
linked under dbpath
.
library(methylKit)
file.list=list( system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
system.file("extdata", "control2.myCpG.txt", package = "methylKit") )
# read the files to a methylRawListDB object: myobjDB
# and save in databases in folder methylDB
myobjDB=methRead(file.list,
sample.id=list("test1","test2","ctrl1","ctrl2"),
assembly="hg18",
treatment=c(1,1,0,0),
context="CpG",
dbtype = "tabix",
dbdir = "methylDB"
)
print(myobjDB[[1]]@dbpath)
## [1] "/tmp/RtmpLizqE7/Rbuild2b70b07b45ff5e/methylKit/vignettes/methylDB/test1.txt.bgz"
Most if not all functions in this package will work with methylDB
objects the same way as it does with normal methylKit objects. Functions
that return methylKit objects, will return a methylDB object if
provided, but there are a few exceptions such as the
select
, the [
and the
selectByOverlap
functions.
Alternatively, methylation percentage calls can be calculated from
sorted SAM or BAM file(s) from Bismark aligner and read-in to the
memory. Bismark is a popular aligner for bisulfite sequencing reads,
available here
(Krueger and Andrews 2011).
processBismarkAln
function is designed to read-in Bismark
SAM/BAM files as methylRaw
or methylRawList
objects which store per base methylation calls. SAM files must be sorted
by chromosome and read position columns, using ‘sort’ command in
unix-like machines will accomplish such a sort easily. BAM files should
be sorted and indexed. This could be achieved with samtools (http://www.htslib.org/doc/samtools.html).
The following command reads a sorted SAM file and creates a
methylRaw
object for CpG methylation. The user has the
option to save the methylation call files to a folder given by
save.folder
option. The saved files can be read-in using
the methRead
function when needed.
my.methRaw=processBismarkAln( location =
system.file("extdata",
"test.fastq_bismark.sorted.min.sam",
package = "methylKit"),
sample.id="test1", assembly="hg18",
read.context="CpG", save.folder=getwd())
It is also possible to read multiple SAM files at the same time,
check processBismarkAln
documentation.
Since we read the methylation data now, we can check the basic stats
about the methylation data such as coverage and percent methylation. We
now have a methylRawList
object which contains methylation
information per sample. The following command prints out percent
methylation statistics for second sample: “test2”
getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)
## methylation statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 20.00 82.79 63.17 94.74 100.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.00000 0.00000 0.00000 48.38710 70.00000 82.78556 90.00000 93.33333
## 80% 90% 95% 99% 99.5% 99.9% 100%
## 96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
The following command plots the histogram for percent methylation distribution.The figure below is the histogram and numbers on bars denote what percentage of locations are contained in that bin. Typically, percent methylation histogram should have two peaks on both ends. In any given cell, any given base are either methylated or not. Therefore, looking at many cells should yield a similar pattern where we see lots of locations with high methylation and lots of locations with low methylation.
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
We can also plot the read coverage per base information in a similar way, again numbers on bars denote what percentage of locations are contained in that bin. Experiments that are highly suffering from PCR duplication bias will have a secondary peak towards the right hand side of the histogram.
getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
It might be useful to filter samples based on coverage. Particularly,
if our samples are suffering from PCR bias it would be useful to discard
bases with very high read coverage. Furthermore, we would also like to
discard bases that have low read coverage, a high enough read coverage
will increase the power of the statistical tests. The code below filters
a methylRawList
and discards bases that have coverage below
10X and also discards the bases that have more than 99.9th percentile of
coverage in each sample.
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,
hi.count=NULL,hi.perc=99.9)
In order to do further analysis, we will need to get the bases
covered in all samples. The following function will merge all samples to
one object for base-pair locations that are covered in all samples.
Setting destrand=TRUE
(the default is FALSE) will merge
reads on both strands of a CpG dinucleotide. This provides better
coverage, but only advised when looking at CpG methylation (for CpH
methylation this will cause wrong results in subsequent analyses). In
addition, setting destrand=TRUE
will only work when
operating on base-pair resolution, otherwise setting this option TRUE
will have no effect. The unite()
function will return a
methylBase
object which will be our main object for all
comparative analysis. The methylBase
object contains
methylation information for regions/bases that are covered in all
samples.
meth=unite(myobj, destrand=FALSE)
## uniting...
Let us take a look at the data content of methylBase object:
head(meth)
## chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1 chr21 9853296 9853296 + 17 10 7 333 268 65
## 2 chr21 9853326 9853326 + 17 12 5 329 249 79
## 3 chr21 9860126 9860126 + 39 38 1 83 78 5
## 4 chr21 9906604 9906604 + 68 42 26 111 97 14
## 5 chr21 9906616 9906616 + 68 52 16 111 104 7
## 6 chr21 9906619 9906619 + 68 59 9 111 109 2
## coverage3 numCs3 numTs3 coverage4 numCs4 numTs4
## 1 18 16 2 395 341 54
## 2 16 14 2 379 284 95
## 3 83 83 0 41 40 1
## 4 23 18 5 37 33 4
## 5 23 14 9 37 27 10
## 6 22 18 4 37 29 8
By default, unite
function produces bases/regions
covered in all samples. That requirement can be relaxed using
“min.per.group” option in unite
function.
# creates a methylBase object,
# where only CpGs covered with at least 1 sample per group will be returned
# there were two groups defined by the treatment vector,
# given during the creation of myobj: treatment=c(1,1,0,0)
meth.min=unite(myobj,min.per.group=1L)
We can check the correlation between samples using
getCorrelation
. This function will either plot scatter plot
and correlation coefficients or just print a correlation matrix.
getCorrelation(meth,plot=TRUE)
## test1 test2 ctrl1 ctrl2
## test1 1.0000000 0.9252530 0.8767865 0.8737509
## test2 0.9252530 1.0000000 0.8791864 0.8801669
## ctrl1 0.8767865 0.8791864 1.0000000 0.9465369
## ctrl2 0.8737509 0.8801669 0.9465369 1.0000000
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical paramete