1 Introduction

The goal of this tutorial is to present a standard analysis workflow of 450K data with the package minfi, incorporating the functions recently added to the package. We invite you to read the software paper recently published (Martin J Aryee et al. 2014) and the online package vignette on the Bioconductor project website for more details.

We will start from the very beginning by reading input raw data (IDAT files) from an example dataset, and ending with a list of candidate genes for differential methylation. Among others, we will cover quality control assessments, different within-array and between-array normalizations, SNPs removal, sex prediction, differentially methylated positions (DMPs) analysis and bump hunting for differentially methylated regions (DMRs).

If time permits, we will introduce a complementary interactive visualization tool package available through Bioconductor, shinyMethyl, that allows interactive quality control assessment.

1.1 Dependencies

library(minfi)
library(minfiData)
library(sva)

These dependencies can be installed from Bioconductor using

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("minfiData", "sva"))

This tutorial can be installed from GitHub using devtools by

library(devtools)
install_github("hansenlab/tutorial.450k")

1.2 450k Array design and terminology

In this section, we briefly introduce the 450K array as well as the terminology used throughout the minfi package. Each sample is measured on a single array, in two different color channels (red and green). As the name of the platform indicates, each array measures more than 450,000 CpG positions. For each CpG, we have two measurements: a methylated intensity and an unmethylated intensity. Depending on the probe design, the signals are reported in different colors:

alt tag

For Type I design, both signals are measured in the same color: one probe for the methylated signal and one probe for the unmethylated signal. For Type II design, only one probe is used. The intensity read in the green channel measures the methylated signal, and the intensity read in the red channel measures the unmethylated signal.

Two commonly measures are used to report the methylation levels: Beta values and M values.

Beta value: \[\frac{M}{M + U + 100}\]

where \(M\) and \(U\) denote the methylated and unmethylated signals respectively.

MValue: \[Mval = \log{\biggl(\frac{M}{U}\biggr)}\]

DMP: Differentially methylated position: single genomic position that has a different methylated level in two different groups of samples (or conditions)

DMR: Differentially methylated region: when consecutive genomic locations are differentially methylated in the same direction.

Array: One sample

Slide: Physical slide containing 12 arrays (\(6 \times 2\) grid)

Plate Physical plate containing at most 8 slides (96 arrays). For this tutorial, we use batch and plate interchangeably.

2 Reading Data

The starting point of minfi is reading the .IDAT files with the built-in function read.450k.exp. Several options are available: the user can specify the sample filenames to be read in along with the directory path, or can specify the directory that contains the files. In the latter case, all the files with the extension .IDAT located in the directory will be loaded into R. The user can also read in a sample sheet, and then use the sample sheet to load the data into a RGChannelSet. For more information, see the minfi vignette. Here, we will load the dataset containing 6 samples from the minfiData package using the sample sheet provided within the package:

baseDir <- system.file("extdata", package="minfiData")
targets <- read.450k.sheet(baseDir)

## [read.450k.sheet] Found the following CSV files:
## [1] "/usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/SampleSheet.csv"

targets

##   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1    GroupA_3          H5           NA       GroupA      NA    id3  83   M
## 2    GroupA_2          D5           NA       GroupA      NA    id2  58   F
## 3    GroupB_3          C6           NA       GroupB      NA    id3  83   M
## 4    GroupB_1          F7           NA       GroupB      NA    id1  75   F
## 5    GroupA_1          G7           NA       GroupA      NA    id1  75   F
## 6    GroupB_2          H7           NA       GroupB      NA    id2  58   F
##   status  Array      Slide
## 1 normal R02C02 5723646052
## 2 normal R04C01 5723646052
## 3 cancer R05C02 5723646052
## 4 cancer R04C02 5723646053
## 5 normal R05C02 5723646053
## 6 cancer R06C02 5723646053
##                                                                                 Basename
## 1 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /usr/local/R/R-3.2.x/lib/R/site-library/minfiData/extdata/5723646053/5723646053_R06C02

RGSet <- read.450k.exp(targets = targets)

The class of RGSet is a RGChannelSet object. This is the initial object of a minfi analysis that contains the raw intensities in the green and red channels. Note that this object contains the intensities of the internal control probes as well. Because we read the data from a data sheet experiment, the phenotype data is also stored in the RGChannelSet and can be accessed via the accessor command pData:

phenoData <- pData(RGSet)
phenoData[,1:6]

##                   Sample_Name Sample_Well Sample_Plate Sample_Group
## 5723646052_R02C02    GroupA_3          H5           NA       GroupA
## 5723646052_R04C01    GroupA_2          D5           NA       GroupA
## 5723646052_R05C02    GroupB_3          C6           NA       GroupB
## 5723646053_R04C02    GroupB_1          F7           NA       GroupB
## 5723646053_R05C02    GroupA_1          G7           NA       GroupA
## 5723646053_R06C02    GroupB_2          H7           NA       GroupB
##                   Pool_ID person
## 5723646052_R02C02      NA    id3
## 5723646052_R04C01      NA    id2
## 5723646052_R05C02      NA    id3
## 5723646053_R04C02      NA    id1
## 5723646053_R05C02      NA    id1
## 5723646053_R06C02      NA    id2

The RGChannelSet stores also a manifest object that contains the probe design information of the array:

manifest <- getManifest(RGSet)
manifest

## IlluminaMethylationManifest object
## Annotation
##   array: IlluminaHumanMethylation450k
## Number of type I probes: 135476 
## Number of type II probes: 350036 
## Number of control probes: 850 
## Number of SNP type I probes: 25 
## Number of SNP type II probes: 40

head(getProbeInfo(manifest))

## DataFrame with 6 rows and 8 columns
##          Name    AddressA    AddressB       Color       NextBase
##   <character> <character> <character> <character> <DNAStringSet>
## 1  cg00050873    32735311    31717405         Red              A
## 2  cg00212031    29674443    38703326         Red              T
## 3  cg00213748    30703409    36767301         Red              A
## 4  cg00214611    69792329    46723459         Red              A
## 5  cg00455876    27653438    69732350         Red              A
## 6  cg01707559    45652402    64689504         Red              A
##                                            ProbeSeqA
##                                       <DNAStringSet>
## 1 ACAAAAAAACAACACACAACTATAATAATTTTTAAAATAAATAAACCCCA
## 2 CCCAATTAACCACAAAAACTAAACAAATTATACAATCAAAAAAACATACA
## 3 TTTTAACACCTAACACCATTTTAACAATAAAAATTCTACAAAAAAAAACA
## 4 CTAACTTCCAAACCACACTTTATATACTAAACTACAATATAACACAAACA
## 5 AACTCTAAACTACCCAACACAAACTCCAAAAACTTCTCAAAAAAAACTCA
## 6 ACAAATTAAAAACACTAAAACAAACACAACAACTACAACAACAAAAAACA
##                                            ProbeSeqB      nCpG
##                                       <DNAStringSet> <integer>
## 1 ACGAAAAAACAACGCACAACTATAATAATTTTTAAAATAAATAAACCCCG         2
## 2 CCCAATTAACCGCAAAAACTAAACAAATTATACGATCGAAAAAACGTACG         4
## 3 TTTTAACGCCTAACACCGTTTTAACGATAAAAATTCTACAAAAAAAAACG         3
## 4 CTAACTTCCGAACCGCGCTTTATATACTAAACTACAATATAACGCGAACG         5
## 5 AACTCTAAACTACCCGACACAAACTCCAAAAACTTCTCGAAAAAAACTCG         2
## 6 GCGAATTAAAAACACTAAAACGAACGCGACGACTACAACGACAAAAAACG         6

See the minfi vignette for more information.

3 Classes

alt tag

3.1 MethylSet and RatioSet

A MethylSet objects contains only the methylated and unmethylated signals. You create this by

MSet <- preprocessRaw(RGSet) 
MSet

## MethylSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 6 samples 
##   element names: Meth, Unmeth 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: 5723646052_R02C02 5723646052_R04C01 ...
##     5723646053_R06C02 (6 total)
##   varLabels: Sample_Name Sample_Well ... filenames (13 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.15.11
##   Manifest version: 0.4.0

This function matches up the different probes and color channels. Note that the dimension of this object is much smaller than for the RGChannelSet; this is because CpGs measured by type I probes are measured by 2 probes.

The accessors getMeth and getUnmeth can be used to get the methylated and unmethylated intensities matrices:

head(getMeth(MSet)[,1:3])

##            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## cg00050873             22041               588             20505
## cg00212031               679               569               439
## cg00213748              1620               421               707
## cg00214611               449               614               343
## cg00455876              5921               398              3257
## cg01707559              1238               646               637

head(getUnmeth(MSet)[,1:3])

##            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## cg00050873              1945               433              1012
## cg00212031              6567               300              2689
## cg00213748               384               461               295
## cg00214611              4869               183              1655
## cg00455876              1655               792              1060
## cg01707559             12227              1009              7414

Other preprocessing functions exists; see below for an extensive discussion.

A RatioSet object is a class designed to store Beta values and/or M values instead of the methylated and unmethylated signals. An optional copy number matrix, CN, the sum of the methylated and unmethylated signals, can be also stored. Mapping a MethylSet to a RatioSet may be irreversible, i.e. one cannot be guranteed to retrieve the methylated and unmethylated signals from a RatioSet. A RatioSet can be created with the function ratioConvert:

RSet <- ratioConvert(MSet, what = "both", keepCN = TRUE)
RSet

## RatioSet (storageMode: lockedEnvironment)
## assayData: 485512 features, 6 samples 
##   element names: Beta, CN, M 
## An object of class 'AnnotatedDataFrame'
##   sampleNames: 5723646052_R02C02 5723646052_R04C01 ...
##     5723646053_R06C02 (6 total)
##   varLabels: Sample_Name Sample_Well ... filenames (13 total)
##   varMetadata: labelDescription
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.15.11
##   Manifest version: 0.4.0

The functions getBeta, getM and getCN return respectively the Beta value matrix, M value matrix and the Copy Number matrix.

beta <- getBeta(RSet)

3.2 GenomicRatioSet

The function mapToGenome applied to a RatioSet object will add genomic coordinates to each probe together with some additional annotation information. The output object is a GenomicRatioSet (class holding M or/and Beta values together with associated genomic coordinates). It is possible to merge the manifest object with the genomic locations by setting the option mergeManifest to TRUE.

GRset <- mapToGenome(RSet)
GRset

## class: GenomicRatioSet 
## dim: 485512 6 
## metadata(0):
## assays(3): Beta M CN
## rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
## rowRanges metadata column names(0):
## colnames(6): 5723646052_R02C02 5723646052_R04C01 ...
##   5723646053_R05C02 5723646053_R06C02
## colData names(13): Sample_Name Sample_Well ... Basename filenames
## Annotation
##   array: IlluminaHumanMethylation450k
##   annotation: ilmn12.hg19
## Preprocessing
##   Method: Raw (no normalization or bg correction)
##   minfi version: 1.15.11
##   Manifest version: 0.4.0

Note that the GenomicRatioSet extends the class SummarizedExperiment. Here are the main accessors functions to access the data:

beta <- getBeta(GRset)
M <- getM(GRset)
CN <- getCN(GRset)

sampleNames <- sampleNames(GRset)
probeNames <- featureNames(GRset)
pheno <- pData(GRset)

To return the probe locations as a GenomicRanges objects, one can use the accessor granges:

gr <- granges(GRset)
head(gr, n= 3)

## GRanges object with 3 ranges and 0 metadata columns:
##              seqnames         ranges strand
##                 <Rle>      <IRanges>  <Rle>
##   cg13869341     chr1 [15865, 15865]      *
##   cg14008030     chr1 [18827, 18827]      *
##   cg12045430     chr1 [29407, 29407]      *
##   -------
##   seqinfo: 24 sequences from hg19 genome; no seqlengths

To access the full annotation, one can use the command getAnnotation:

annotation <- getAnnotation(GRset)
names(annotation)

##  [1] "chr"                      "pos"                     
##  [3] "strand"                   "Name"                    
##  [5] "AddressA"                 "AddressB"                
##  [7] "ProbeSeqA"                "ProbeSeqB"               
##  [9] "Type"                     "NextBase"                
## [11] "Color"                    "Probe_rs"                
## [13] "Probe_maf"                "CpG_rs"                  
## [15] "CpG_maf"                  "SBE_rs"                  
## [17] "SBE_maf"                  "Islands_Name"            
## [19] "Relation_to_Island"       "Forward_Sequence"        
## [21] "SourceSeq"                "Random_Loci"             
## [23] "Methyl27_Loci"            "UCSC_RefGene_Name"       
## [25] "UCSC_RefGene_Accession"   "UCSC_RefGene_Group"      
## [27] "Phantom"                  "DMR"                     
## [29] "Enhancer"                 "HMM_Island"              
## [31] "Regulatory_Feature_Name"  "Regulatory_Feature_Group"
## [33] "DHS"

3.3 GenomicMethylSet

The mapToGenome function can also be used on MethylSet to get a GenomicMethylSet. The ratioConvert and mapToGenome functions commute, see the diagram above.

4 Quality control

4.1 QC plot

minfi provides a simple quality control plot that uses the log median intensity in both the methylated (M) and unmethylated (U) channels. When plotting these two medians against each other, it has been observed that good samples cluster together, while failed samples tend to separate and have lower median intensities. In order to obtain the methylated and unmethylated signals, we need to convert the RGChannelSet to an object containing the methylated and unmethylated signals using the function preprocessRaw. It takes as input a RGChannelSet and converts the red and green intensities to methylated and unmethylated signals according to the special 450K probe design, and returns the converted signals in a new object of class MethylSet. It does not perform any normalization.

The accessors getMeth and getUnmeth can be used to get the methylated and unmethylated intensities matrices:

head(getMeth(MSet)[,1:3])

##            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## cg00050873             22041               588             20505
## cg00212031               679               569               439
## cg00213748              1620               421               707
## cg00214611               449               614               343
## cg00455876              5921               398              3257
## cg01707559              1238               646               637

head(getUnmeth(MSet)[,1:3])

##            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## cg00050873              1945               433              1012
## cg00212031              6567               300              2689
## cg00213748               384               461               295
## cg00214611              4869               183              1655
## cg00455876              1655               792              1060
## cg01707559             12227              1009              7414

The functions getQC and plotQC are designed to extract and plot the quality control information from the MethylSet:

qc <- getQC(MSet)
head(qc)

## DataFrame with 6 rows and 2 columns
##                        mMed      uMed
##                   <numeric> <numeric>
## 5723646052_R02C02  11.69566  11.82058
## 5723646052_R04C01  11.99046  11.95274
## 5723646052_R05C02  11.55603  12.05393
## 5723646053_R04C02  12.06609  12.09276
## 5723646053_R05C02  12.23332  12.08448
## 5723646053_R06C02  11.36851  11.60594

plotQC(qc)

Moreover, the function addQC applied to the MethylSet will add the QC information to the phenotype data.

To further explore the quality of the samples, it is useful to look at the Beta value densities of the samples, with the option to color the densities by group:

densityPlot(MSet, sampGroups = phenoData$Sample_Group)

or density bean plots:

densityBeanPlot(MSet, sampGroups = phenoData$Sample_Group)

shinyMethyl is particularly useful to visualize all plots at the same time in an interactive fashion.

4.2 Control probes plot

The 450k array contains several internal control probes that can be used to assess the quality control of different sample preparation steps (bisulfite conversion, hybridization, etc.). The values of these control probes are stored in the initial RGChannelSet and can be plotted by using the function controlStripPlot and by specifying the control probe type:

controlStripPlot(RGSet, controls="BISULFITE CONVERSION II")

All the plots above can be exported into a pdf file in one step using the function qcReport:

qcReport(RGSet, pdf= "qcReport.pdf")

4.3 Sex prediction

By looking at the median total intensity of the X chromosome-mapped probes, denoted \(med(X)\), and the median total intensity of the Y-chromosome-mapped probes, denoted \(med(Y)\), one can observe two different clusters of points corresponding to which gender the samples belong to. To predict the gender, minfi separates the points by using a cutoff on \(\log_2{med(Y)}-\log_2{med(Y)}\). The default cutoff is \(-2\). Since the algorithm needs to map probes to the X-chr and to the Y-chr, the input of the function getSex needs to be a GenomicMethylSet or a GenomicRatioSet.

predictedSex <- getSex(GRset, cutoff = -2)$predictedSex
head(predictedSex)

## [1] "M" "F" "M" "F" "F" "F"

To choose the cutoff to separate the two gender clusters, one can plot \(med(Y)\) against \(med(Y)\) with the function plotSex:

plotSex(getSex(GRset, cutoff = -2))

Finally, the function addSex applied to the GenomicRatioSet will add the predicted sex to the phenotype data.

Note: the function does not handle datasets with only females or only males

5 Preprocessing and normalization

So far, we did not use any normalization to process the data. Different normalization procedures are available in minfi. Many people have asked us which normalization they should apply to their dataset. Our rule of thumb is the following. If there exist global biological methylation differences between your samples, as for instance a dataset with cancer and normal samples, or a dataset with different tissues/cell types, use the preprocessFunnorm function as it is aimed for such datasets. On the other hand, if you do not expect global differences between your samples, for instance a blood dataset, or one-tissue dataset, use the preprocessQuantile function. In our experience, these two normalization procedures perform always better than the functions preprocessRaw, preprocessIllumina and preprocessSWAN discussed below. For convenience, these functions are still implemented in the minfi package.

5.1 preprocessRaw

As seen before, it converts a RGChannelSet to a MethylSet by converting the Red and Green channels into a matrix of methylated signals and a matrix of unmethylated signals. No normalization is performed.

Input: RGChannelSet
Output: MethylSet

5.2 preprocessIllumina

Convert a RGChannelSet to a MethylSet by implementing the preprocessing choices as available in Genome Studio: background subtraction and control normalization. Both of them are optional and turning them off is equivalent to raw preprocessing (preprocessRaw):

MSet.illumina <- preprocessIllumina(RGSet, bg.correct = TRUE,
                               normalize = "controls")

Input: RGChannelSet
Output: MethylSet

5.3 preprocessSWAN

Perform Subset-quantile within array normalization (SWAN) (Jovana Maksimovic, Lavinia Gordon, and Alicia Oshlack 2012), a within-array normalization correction for the technical differences between the Type I and Type II array designs. The algorithm matches the Beta-value distributions of the Type I and Type II probes by applying a within-array quantile normalization separately for different subsets of probes (divided by CpG content). The input of SWAN is a MethylSet, and the function returns a MethylSet as well. If an RGChannelSet is provided instead, the function will first call preprocessRaw on the RGChannelSet, and then apply the SWAN normalization. We recommend setting a seed (using set.seed) before using preprocessSWAN to ensure that the normalized intensities will be reproducible.

MSet.swan <- preprocessSWAN(RGSet)

Input: RGChannelSet or MethylSet
Output: MethylSet

5.4 preprocessQuantile

This function implements stratified quantile normalization preprocessing. The normalization procedure is applied to the Meth and Unmeth intensities separately. The distribution of type I and type II signals is forced to be the same by first quantile normalizing the type II probes across samples and then interpolating a reference distribution to which we normalize the type I probes. Since probe types and probe regions are confounded and we know that DNAm distributions vary across regions we stratify the probes by region before applying this interpolation. Note that this algorithm relies on the assumptions necessary for quantile normalization to be applicable and thus is not recommended for cases where global changes are expected such as in cancer-normal comparisons. Note that this normalization procedure is essentially similar to one previously presented (Nizar Touleimat and Jörg Tost 2012). The different options can be summarized into the following list:

If fixMethOutlier is TRUE, the functions fixes outliers of both the methylated and unmethylated channels when small intensities are close to zero.
If removeBadSamples is TRUE, it removes bad samples using the QC criterion discussed previously
Performs stratified subset quantile normalization if quantileNormalize=TRUE and stratified=TRUE
Predicts the sex (if not provided in the sex argument) using the function getSex and normalizes males and females separately for the probes on the X and Y chromosomes

GRset.quantile <- preprocessQuantile(RGSet, fixOutliers = TRUE,
  removeBadSamples = TRUE, badSampleCutoff = 10.5,
  quantileNormalize = TRUE, stratified = TRUE, 
  mergeManifest = FALSE, sex = NULL)

## [preprocessQuantile] Mapping to genome.
## [preprocessQuantile] Fixing outliers.
## [preprocessQuantile] Quantile normalizing.

Input: RGChannelSet
Output: GenomicRatioSet

Note: The function returns a GenomicRatioSet object ready for downstream analysis.

5.5 preprocessNoob

The function preprocessNoob implements the noob background subtraction method with dye-bias normalization discussed in (Timothy J Triche et al. 2013). Briefly, the background subtraction method estimates background noise from the out-of-band probes and remove it for each sample separately, while the dye-bias normalization utilizes a subset of the control probes to estimate the dye bias. By default, the function will perform both procedures:

MSet.noob <- preprocessNoob(RGSet)

## [preprocessNoob] Using sample number 2 as reference level...

Input: RGChannelSet
Output: MethylSet

5.6 preprocessFunnorm

The function preprocessFunnorm implements the functional normalization algorithm developed in (Jean-Philippe Fortin et al. 2014). Briefly, it uses the internal control probes present on the array to infer between-array technical variation. It is particularly useful for studies comparing conditions with known large-scale differences, such as cancer/normal studies, or between-tissue studies. It has been shown that for such studies, functional normalization outperforms other existing approaches (Jean-Philippe Fortin et al. 2014). By default, the function applies the preprocessNoob function as a first step for background substraction, and uses the first two principal components of the control probes to infer the unwanted variation.

GRset.funnorm <- preprocessFunnorm(RGSet)

## [preprocessFunnorm] Background and dye bias correction with noob 
## [preprocessNoob] Using sample number 2 as reference level...
## [preprocessFunnorm] Mapping to genome
## [preprocessFunnorm] Quantile extraction
## [preprocessFunnorm] Normalization

Input: RGChannelSet
Output: GenomicRatioSet

As the preprocessQuantile function, it returns a GenomicRatioSet object.

6 Genetic variants and cell type composition

6.1 SNPs

Because the presence of SNPs inside the probe body or at the nucleotide extension can have important consequences on the downstream analysis, minfi offers the possibility to remove such probes. The function getSnpInfo, applied to a GenomicRatioSet, returns a data frame with 6 columns containing the SNP information of the probes:

snps <- getSnpInfo(GRset)
head(snps,10)

## DataFrame with 10 rows and 6 columns
##               Probe_rs Probe_maf      CpG_rs   CpG_maf      SBE_rs
##            <character> <numeric> <character> <numeric> <character>
## cg13869341          NA        NA          NA        NA          NA
## cg14008030          NA        NA          NA        NA          NA
## cg12045430          NA        NA          NA        NA          NA
## cg20826792          NA        NA          NA        NA          NA
## cg00381604          NA        NA          NA        NA          NA
## cg20253340          NA        NA          NA        NA          NA
## cg21870274          NA        NA          NA        NA          NA
## cg03130891  rs77418980  0.305556          NA        NA          NA
## cg24335620 rs147502335  0.012800          NA        NA          NA
## cg16162899          NA        NA          NA        NA          NA
##              SBE_maf
##            <numeric>
## cg13869341        NA
## cg14008030        NA
## cg12045430        NA
## cg20826792        NA
## cg00381604        NA
## cg20253340        NA
## cg21870274        NA
## cg03130891        NA
## cg24335620        NA
## cg16162899        NA

Probe, CpG and SBE correspond the SNPs present inside the probe body, at the CpG interrogation and at the single nucleotide extension respectively. The columns with rs give the names of the SNPs while the columns with maf gives the minor allele frequency of the SNPs based on the dbSnp database. The function addSnpInfo will add to the GenomicRanges of the GenomicRatioSet the 6 columns:

GRset <- addSnpInfo(GRset)

We strongly recommend to drop the probes that contain either a SNP at the CpG interrogation or at the single nucleotide extension.The function dropLociWithSnps allows to drop the corresponding probes. Here is an example where we drop the probes containing a SNP at the CpG interrogation and/or at the single nucleotide extension, for any minor allele frequency:

GRset <- dropLociWithSnps(GRset, snps=c("SBE","CpG"), maf=0)

6.2 Cell type composition

As shown in (Andrew E Jaffe and Rafael A Irizarry 2014), biological findings in blood samples can often be confounded with cell type composition. In order to estimate the confounding levels between phenotype and cell type composition, the function estimateCellCounts depending on the package FlowSorted.Blood.450kestimates the cell type composition of blood samples by using a modified version of the algorithm described in (E Andres Houseman et al. 2012). The function takes as input a RGChannelSet and returns a cell counts vector for each samples:

library(FlowSorted.Blood.450k)
cellCounts <- estimateCellCounts(RGset)

7 Identifying DMRs and DMPs

7.1 dmpFinder: to find differentially methylated positions (DMPs)

While we do not particularly encourage a single position differential methylation analysis, the function dmpFinder finds differentially methylated positions (DMPs) with respect to a phenotype covariate. The phenotype may be categorical (e.g. cancer vs. normal) or continuous (e.g. blood pressure). Below is an example of a DMP analysis for age using the GRset.funnorm object created above:

beta <- getBeta(GRset.funnorm)
age  <- pData(GRset.funnorm)$age
dmp <- dmpFinder(beta, pheno = age  , type = "continuous")
head(dmp)

##                intercept          beta         t         pval      qval
## cg19995126  1.990494e+00 -0.0235606444 -69.25397 2.604769e-07 0.1264647
## cg10467968  3.320006e-01  0.0069860611  39.02973 2.574370e-06 0.5113251
## cg11425201  8.452487e-01  0.0009700050  37.07724 3.159500e-06 0.5113251
## cg24814628  1.020733e+00 -0.0025065546 -28.23109 9.367345e-06 0.9583333
## cg01854459  6.101549e-02 -0.0003851095 -25.47377 1.410360e-05 0.9583333
## cg26634885 -7.791584e-05  0.0002179788  24.33811 1.690953e-05 0.9583333

The function is essentially a wrapper around lmFit from limma.

7.2 bumphunter: to find differentially methylated regions (DMRs)

The bumphunter function in minfi is a version of the bump hunting algorithm (Andrew E Jaffe et al. 2012) adapted to the 450k array, relying on the bumphunter function implemented in the eponym package bumphunter

Instead of looking for association between a single genomic location and a phenotype of interest, bumphunter looks for genomic regions that are differentially methylated between two conditions. In the context of the 450k array, the algorithm first defines clusters of probes. Clusters are simply groups of probes such that two consecutive probe locations in the cluster are not separated by more than some distance mapGap.

Briefly, the algorithm first computes a t-statistic at each genomic location, with optional smoothing. Then, it defines a candidate region to be a cluster of probes for which all the t-statistics exceed a predefined threshold. To test for significance of the candidate regions, the algorithm uses permutations (defined by the parameter B). The permutation scheme is expensive, and can take a few days when the number of candidate bumps is large. To avoid wasting time, we propose the following guideline:

Define your phenotype of interest

pheno <- pData(GRset.funnorm)$status
designMatrix <- model.matrix(~ pheno)

Run the algorithm with B=0 permutation on the Beta-values, with a medium difference cutoff, say 0.2 (which corresponds to 20\% difference on the Beta-values):

dmrs <- bumphunter(GRset.funnorm, design = designMatrix, 
             cutoff = 0.2, B=0, type="Beta")

If the number of candidate bumps is large, say \(>30000\), increase the cutoff to reduce the number of candidate bumps. The rationale behind this is that the most of the additional candidate regions found by lowering the cutoff will be found to be non-significant after the permutation scheme, and therefore time can be saved by being more stringent on the cutoff (high cutoff).
Once you have decided on the cutoff, run the algorithm with a large number of permutations, say B=1000:

dmrs <- bumphunter(GRset.funnorm, design = designMatrix, 
             cutoff = 0.2, B=1000, type="Beta")

Since the permutation scheme can be expensive, parallel computation is implemented in the bumphunter function. The foreach package allows different parallel back-ends that will distribute the computation across multiple cores in a single machine, or across machines in a cluster. For instance, if one wished to use 3 cores, the two following commands have to be run before running bumphunter:

library(doParallel)
registerDoParallel(cores = 3)

The results of bumphunter are stored in a data frame with the rows being the different differentially methylated regions (DMRs):

names(dmrs)
head(dmrs$table, n=3)

As an example, we have run the bump hunting algorithm to find DMRs between colon and kidney (20 samples each from TCGA), with \(B=1000\) permutations, and a cutoff of 0.2 on the Beta values:

data("dmrs_B1000_c02")
head(dmrs$table)

The start and end columns indicate the limiting genomic locations of the DMR; the value column indicates the average difference in methylation in the bump, and the area column indicates the area of the bump with respect to the 0 line. The fwer column returns the family-wise error rate (FWER) of the regions estimated by the permeation scheme. One can filter the results by picking a cutoff on the FWER.

7.3 Block finder.

The approximately 170,000 open sea probes on the 450K array can be used to detect long-range changes in methylation status. These large scale changes that can range up to several Mb have typically been identified only through whole-genome bisulfite sequencing. The function blockFinder groups the average methylation values in open-sea probe cluster (via cpgCollapse) into large regions, and then run the bumphunter algorithm with a large (250KB+) smoothing window (see the bump hunting section for DMRs above).

7.4 Batch effects correction with SVA

Surrogate variable analysis (SVA) (Jeffrey T Leek and John D Storey 2007,Jeffrey T Leek and John D Storey (2008)) is a useful tool to identified surrogate variables for unwanted variation while protecting for a phenotype of interest. In our experience, running SVA after normalizing the 450K data with preprocessFunnorm or preprocessQuantile increases the statistical power of the downstream analysis. For instance, to run SVA on the M-values, protecting for case-control status, the following code can be used to estimate the surrogate variables (this can take a few hours to run):

library(sva)
mval <- getM(GRset)[1:5000,]
pheno <- pData(GRset)
mod <- model.matrix(~as.factor(status), data=pheno)
mod0 <- model.matrix(~1, data=pheno)
sva.results <- sva(mval, mod, mod0)

## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

Once the surrogate variables are computed, one can include them in the downstream analysis to adjust for unknown unwanted variation. See the sva vignette for a more comprehensive use of sva.

8 Other tasks

8.1 A/B compartments prediction.

The function compartments extracts A/B compartments from Illumina methylation microarrays, as described in (Fortin and Hansen 2015). Analysis of Hi-C data has shown that the genome can be divided into two compartments (A/B compartments) that are cell-type specific and are associated with open and closed chromatin respectively. The approximately 170,000 open sea probes on the 450k array can be used to estimate these compartments by computing the first eigenvector on a binned correlation matrix. The binning resolution can be specified by resolution, and by default is set to a 100 kb. We do not recommend higher resolutions because of the low-resolution probe design of the 450k array. For instance, to extract the compartments for chromosome 12, one can use

ab <- compartments(grset.quantile, chr="chr14", resolution=100`1000)

8.2 Potentially useful functions

8.2.1 getSnpBeta

The array contains by design 65 probes that are not meant to interrogate methylation status, but instead are designed to interrogate SNPs. By default, minfi drops these probes. The function getSnpBeta allows the user to extract the Beta values for those probes from an RGChannelSet. The return object is a matrix with the columns being the samples and the rows being the different SNP probes:

snps <- getSnpBeta(RGSet)
head(snps)

##            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
## rs1019916         0.44141800        0.50740172         0.4854442
## rs10457834        0.50760422        0.55113785         0.6654530
## rs1416770         0.06412214        0.04665314         0.1041461
## rs1941955         0.94617711        0.49905060         0.9499797
## rs2125573         0.54285459        0.56833273         0.6112521
## rs2235751         0.47123623        0.47824994         0.3911962
##            5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
## rs1019916         0.57308474        0.52069297         0.8387841
## rs10457834        0.42238892        0.55499709         0.5689888
## rs1416770         0.91858483        0.93748229         0.1125396
## rs1941955         0.90839010        0.93859649         0.7683307
## rs2125573         0.11020630        0.11503778         0.8059526
## rs2235751         0.06187767        0.06915974         0.5231096

These SNP probes are intended to be used for sample tracking and sample mixups. Each SNP probe ought to have values clustered around 3 distinct values corresponding to homo-, and hetero-zygotes.

8.2.2 Out-of-band probes

The function getOOB applied to an RGChannelSet retrieves the so-called out-of-band (OOB) probes. These are the measurements of Type I probes in the “wrong” color channel. They are used in the preprocessNoob and preprocessFunnorm functions to estimate background noise. The function returns a list with two matrices, named Red and Grn.

oob <- getOOB(RGSet)

9 Session information

As last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package. The session information should also always be included in any emails to the Bioconductor support site along with all code used in the analysis.

sessionInfo()

## R version 3.2.1 Patched (2015-07-12 r68650)
## Platform: x86_64-apple-darwin14.4.0 (64-bit)
## Running under: OS X 10.10.4 (Yosemite)
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] sva_3.15.0                                        
##  [2] genefilter_1.51.0                                 
##  [3] mgcv_1.8-6                                        
##  [4] nlme_3.1-121                                      
##  [5] minfiData_0.11.0                                  
##  [6] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
##  [7] IlluminaHumanMethylation450kmanifest_0.4.0        
##  [8] minfi_1.15.11                                     
##  [9] bumphunter_1.9.1                                  
## [10] locfit_1.5-9.1                                    
## [11] iterators_1.0.7                                   
## [12] foreach_1.4.2                                     
## [13] Biostrings_2.37.2                                 
## [14] XVector_0.9.1                                     
## [15] SummarizedExperiment_0.3.2                        
## [16] GenomicRanges_1.21.16                             
## [17] GenomeInfoDb_1.5.8                                
## [18] IRanges_2.3.14                                    
## [19] S4Vectors_0.7.10                                  
## [20] lattice_0.20-31                                   
## [21] Biobase_2.29.1                                    
## [22] BiocGenerics_0.15.3                               
## [23] BiocStyle_1.7.4                                   
## 
## loaded via a namespace (and not attached):
##  [1] nor1mix_1.2-0            splines_3.2.1           
##  [3] doRNG_1.6                Rsamtools_1.21.13       
##  [5] yaml_2.1.13              RSQLite_1.0.0           
##  [7] limma_3.25.13            quadprog_1.5-5          
##  [9] RGCCA_2.0                digest_0.6.8            
## [11] RColorBrewer_1.1-2       colorspace_1.2-6        
## [13] Matrix_1.2-2             htmltools_0.2.6         
## [15] preprocessCore_1.31.0    plyr_1.8.3              
## [17] GEOquery_2.35.4          siggenes_1.43.0         
## [19] XML_3.98-1.3             mixOmics_5.0-4          
## [21] pheatmap_1.0.7           biomaRt_2.25.1          
## [23] zlibbioc_1.15.0          xtable_1.7-4            
## [25] scales_0.2.5             BiocParallel_1.3.34     
## [27] annotate_1.47.1          beanplot_1.2            
## [29] pkgmaker_0.22            GenomicFeatures_1.21.13 
## [31] survival_2.38-3          magrittr_1.5            
## [33] mclust_5.0.2             evaluate_0.7            
## [35] MASS_7.3-42              tools_3.2.1             
## [37] registry_0.3             formatR_1.2             
## [39] matrixStats_0.14.2       stringr_1.0.0           
## [41] munsell_0.4.2            rngtools_1.2.4          
## [43] AnnotationDbi_1.31.17    lambda.r_1.1.7          
## [45] base64_1.1               futile.logger_1.4.1     
## [47] grid_3.2.1               RCurl_1.95-4.7          
## [49] igraph_1.0.1             bitops_1.0-6            
## [51] rmarkdown_0.7            gtable_0.1.2            
## [53] codetools_0.2-11         multtest_2.25.2         
## [55] DBI_0.3.1                reshape_0.8.5           
## [57] illuminaio_0.11.1        GenomicAlignments_1.5.11
## [59] knitr_1.10.5             rtracklayer_1.29.12     
## [61] futile.options_1.0.0     stringi_0.5-5           
## [63] Rcpp_0.11.6              rgl_0.95.1247

References

Andrew E Jaffe, and Rafael A Irizarry. 2014. “Accounting for cellular heterogeneity is critical in epigenome-wide association studies.” Genome Biology 15 (2): R31. doi:10.1186/gb-2014-15-2-r31.

Andrew E Jaffe, Peter Murakami, Hwajin Lee, Jeffrey T Leek, M Daniele Fallin, Andrew P Feinberg, and Rafael A Irizarry. 2012. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology 41 (1): 200–209. doi:10.1093/ije/dyr238.

E Andres Houseman, William P Accomando, Devin C Koestler, Brock C Christensen, Carmen J Marsit, Heather H Nelson, John K Wiencke, and Karl T Kelsey. 2012. “DNA methylation arrays as surrogate measures of cell mixture distribution.” BMC Bioinformatics 13 (1): 86. doi:10.1186/1471-2105-13-86.

Fortin, Jean-Philippe, and Kasper D. Hansen. 2015. “Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data.” BioRxiv. doi:10.1101/019000.

Jean-Philippe Fortin, Aurélie Labbe, Mathieu Lemire, Brent W Zanke, Thomas J Hudson, Elana J Fertig, Celia MT Greenwood, and Kasper D Hansen. 2014. “Functional normalization of 450k methylation array data improves replication in large cancer studies.” Genome Biology 15 (11): 503. doi:10.1186/s13059-014-0503-2.

Jeffrey T Leek, and John D Storey. 2007. “Capturing heterogeneity in gene expression studies by surrogate variable analysis.” PLoS Genetics 3 (9): 1724–35. doi:10.1371/journal.pgen.0030161.

———. 2008. “A general framework for multiple testing dependence.” Proceedings of the National Academy of Sciences 105 (48): 18718–23. doi:10.1073/pnas.0808709105.

Jovana Maksimovic, Lavinia Gordon, and Alicia Oshlack. 2012. “SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips.” Genome Biology 13 (6): R44. doi:10.1186/gb-2012-13-6-r44.

Martin J Aryee, Andrew E Jaffe, Hector Corrada Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael A Irizarry. 2014. “Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.” Bioinformatics 30 (10): 1363–69. doi:10.1093/bioinformatics/btu049.

Nizar Touleimat, and Jörg Tost. 2012. “Complete pipeline for Infinium Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation.” Epigenomics 4 (3): 325–41. doi:10.2217/epi.12.21.

Timothy J Triche, Daniel J Weisenberger, David Van Den Berg, Peter W Laird, and Kimberly D Siegmund. 2013. “Low-level processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7): e90. doi:10.1093/nar/gkt090.

Analysis of 450k data using minfi

Jean-Philippe Fortin, Kasper Daniel Hansen

Contents

1 Introduction

1.1 Dependencies

1.2 450k Array design and terminology

2 Reading Data

3 Classes

3.1 MethylSet and RatioSet

3.2 GenomicRatioSet

3.3 GenomicMethylSet

4 Quality control

4.1 QC plot

4.2 Control probes plot

4.3 Sex prediction

5 Preprocessing and normalization

5.1 preprocessRaw

5.2 preprocessIllumina

5.3 preprocessSWAN

5.4 preprocessQuantile

5.5 preprocessNoob

5.6 preprocessFunnorm

6 Genetic variants and cell type composition

6.1 SNPs

6.2 Cell type composition

7 Identifying DMRs and DMPs

7.1 dmpFinder: to find differentially methylated positions (DMPs)

7.2 bumphunter: to find differentially methylated regions (DMRs)

7.3 Block finder.

7.4 Batch effects correction with SVA

8 Other tasks

8.1 A/B compartments prediction.

8.2 Potentially useful functions

8.2.1 getSnpBeta

8.2.2 Out-of-band probes

9 Session information

References