# 1 Introduction

## 1.1 Background

Epimutations are rare alterations in the methylation pattern at specific loci. Have been demonstrated that epimutations could be the causative factor of some genetic diseases. For example, epimutations can lead to cancers, such as Lynch syndrome, rare diseases such as Prader-Willi syndrome, and are associated with common disorders, such as autism. Nonetheless, no standard methods are available to detect and quantify these alterations. Two methods for epimutations detection on methylation microarray data have been reported: (1) based on identifying CpGs with outlier values and then cluster them in epimutations (Barbosa et al. 2018); (2) define candidate regions with bumphunter and test their statistical significance with a MANOVA (Aref-Eshghi et al. 2019). However, the implementation of these methods is not publicly available, and these approaches have not been compared.

We have developed epimutacions R/Bioconductor package. We have implemented those two methods (called quantile and manova, respectively). We implemented four additional approaches, using a different distribution to detect CpG outliers (beta), or a different model to assess region significance (mlm, mahdist, and iForest).

The package epimutacions provides tools to raw DNA methylation microarray intensities normalization and epimutations identification, visualization and annotation.
The full epimutacions user´s guide is available in this vignette.
The main function to estimate the epimutations is called epimutations().

The name of the package is epimutacions (pronounced ɛ pi mu ta 'sj ons) which means epimutations in Catalan, a language from the northeast of Spain.

## 1.2 Methodology

The epimutacions package computes a genome-wide DNA methylation analysis to identify the epimutations to be considered as biomarkers for case samples with a suspected genetic disease. The function epimutations() infers epimutations on a case-control design. It compares a case sample against a reference panel (healthy individuals). The package also includes leave-one-out approach (epimutations_one_leave_out()). It compares individual methylation profiles of a single sample (regardless if they are cases or controls) against all other samples from the same cohort.

The epimutacions package includes 6 outlier detection approaches (figure 1): (1) Multivariate Analysis of variance (manova), (2) Multivariate Linear Model (mlm), (3) isolation forest (iForest), (4) robust mahalanobis distance (mahdist) (5) quantile and (6) beta.

The approaches manova, mlm, iForest and mahdist define the candidate regions (differentially methylated regions (DMRs)) using bumphunter method (Jaffe et al. 2012). Then, those DMRs are tested to identify regions with CpGs being outliers when comparing with the reference panel. quantile uses the sliding window approach to individually compare the methylation value in each proband against the reference panel and then cluster them in epimutations.
Beta utilizes beta distribution to identify outlier CpGs.

We have defined an epimutation as a consecutive window of a minimum of 3 outliers CpGs with a maximum distance of 1kb between them (Barbosa et al. 2018).

# 2 Setup

## 2.1 Installing the packages

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("epimutacions")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("epimutacionsData")

library(epimutacions)

## 2.3 Quick start

The workflow in figure 2 explains the main analysis in the epimutacions package.

The package allows two different types of inputs:

1. Case samples IDAT files (raw microarray intensities) together with RGChannelSet object as reference panel. The reference panel can be supplied by the user or can be selected through the example datasets that the package provides (section 3).
1. GenomicRatioSet object containing case and control samples.

The normalization (epi_preprocess()) converts the raw microarray intensities into usable methylation measurement ($$\beta$$ values at CpG locus level). As a result, we obtain a GenomicRatioSet object, which can be used as epimutations() function input. The data should contain information about values of CpG sites, phenotype and feature data.

# 3 Datasets

The package contains 3 example datasets adapted from Gene Expression Omnibus (GEO):

We also included a dataset specifying the 40,408 candidate regions in Illumina 450K array which could be epimutations (see section 3.2).

We created the epimutacionsData package in ExperimentHub. It contains the reference panel, methy and the candidate epimutations datasets. The package includes the IDAT files as external data. To access the datasets we need to install the packages by running the following commands:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("ExperimentHub")

Then, we need to load the package and create an ExperimentHub object:

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, c("epimutacionsData"))
ExperimentHub with 3 records
# snapshotDate(): 2022-04-26
# $dataprovider: GEO, Illumina 450k array #$species: Homo sapiens
# $rdataclass: RGChannelSet, GenomicRatioSet, GRanges # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # rdatapath, sourceurl, sourcetype # retrieve records with, e.g., 'object[["EH6690"]]' title EH6690 | Control and case samples EH6691 | Reference panel EH6692 | Candidate epimutations  ## 3.1 IDAT files and reference panel IDAT files directory in epimutacionsData package: baseDir <- system.file("extdata", package = "epimutacionsData") The reference panel and methy dataset can be found in EH6691 and EH6690 record of the eh object: reference_panel <- eh[["EH6691"]] methy <- eh[["EH6690"]] ## 3.2 Candidate regions In Illumina 450K array, probes are unequally distributed along the genome, limiting the number of regions that can fulfil the requirements to be considered an epimutation. Thus, we have computed a dataset containing all the regions that could be candidates to become an epimutation. We used the clustering approach from bumphunter to define the candidate epimutations. We defined a primary dataset with all the CpGs from the Illumina 450K array. Then, we run bumphunter and selected those regions with at least 3 CpGs and a maximum distance of 1kb between them. As a result, we found 40,408 candidate epimutations. The function epimutation() filters the identified epimutations using these candidate regions. The following is the code used to identify the candidate epimutations in Illumina 450K array: library(minfi) # Regions 450K library(IlluminaHumanMethylation450kanno.ilmn12.hg19) data(Locations) ### Select CpGs (names starting by cg) in autosomic chromosomes locs.450 <- subset(Locations, grepl("^cg", rownames(Locations)) & chr %in% paste0("chr", 1:22)) locs.450GR <- makeGRangesFromDataFrame(locs.450, start.field = "pos", end.field = "pos", strand = "*") locs.450GR <- sort(locs.450GR) mat <- matrix(0, nrow = length(locs.450GR), ncol = 2, dimnames = list(names(locs.450GR), c("A", "B"))) ## Set sample B to all 1 mat[, 2] <- 1 ## Define model matrix pheno <- data.frame(var = c(0, 1)) model <- model.matrix(~ var, pheno) ## Run bumphunter bumps <- bumphunter(mat, design = model, pos = start(locs.450GR), chr = as.character(seqnames(locs.450GR)), cutoff = 0.05)$table
bumps.fil <- subset(bumps, L >= 3)

The candidate regions can be found in EH6692 record of the eh object:

candRegsGR <- eh[["EH6692"]]

# 4 Preprocessing

The epi_preprocess() function allows calling the 6 preprocessing methods from minfi package:

Method Function Description
raw preprocessRaw Converts the Red/Green channel for an Illumina methylation array into methylation signal, without using any normalization
illumina preprocessIllumina Implements preprocessing for Illumina methylation microarrays as used in Genome Studio
swan preprocessSWAN Subset-quantile Within Array Normalisation (SWAN). It allows Infinium I and II type probes on a single array to be normalized together
quantile preprocessQuantile Implements stratified quantile normalization preprocessing for Illumina methylation microarrays
noob preprocessNoob Noob (normal-exponential out-of-band) is a background correction method with dye-bias normalization for Illumina Infinium methylation arrays
funnorm preprocessFunnorm Functional normalization (FunNorm) is a between-array normalization method for the Illumina Infinium HumanMethylation450 platform

Each normalization approach has some unique parameters which can be modified through norm_parameters() function:

parameters Description
illumina
bg.correct Performs background correction
normalize Performs controls normalization
reference The reference array for control normalization
quantile
fixOutliers Low outlier Meth and Unmeth signals will be fixed
quantileNormalize Performs quantile normalization
stratified Performs quantile normalization within region strata
mergeManifest Merged to the output the information in the associated manifest package
sex Sex of the samples
noob
offset Offset for the normexp background correct
dyeCorr Performs dye normalization
dyeMethod Dye bias correction to be done
funnorm
nPCs The number of principal components from the control probes
sex Sex of the samples
bgCorr Performs NOOB background correction before functional normalization
dyeCorr Performs dye normalization
keepCN Keeps copy number estimates

We can obtain the default settings for each method by invoking the function norm_parameters() with no arguments:

norm_parameters()
$illumina$illumina$bg.correct [1] TRUE$illumina$normalize [1] "controls" "no"$illumina$reference [1] 1$quantile
$quantile$fixOutliers
[1] TRUE

$quantile$removeBadSamples
[1] FALSE

$quantile$badSampleCutoff
[1] 10.5

$quantile$quantileNormalize
[1] TRUE

$quantile$stratified
[1] TRUE

$quantile$mergeManifest
[1] FALSE

$quantile$sex
NULL

$noob$noob$offset [1] 15$noob$dyeCorr [1] TRUE$noob$dyeMethod [1] "single" "reference"$funnorm
$funnorm$nPCs
[1] 2

$funnorm$sex
NULL

$funnorm$bgCorr
[1] TRUE

$funnorm$dyeCorr
[1] TRUE

$funnorm$keepCN
[1] FALSE

However, we can modify the parameter(s) as the following example for illumina approach:

parameters <- norm_parameters(illumina = list("bg.correct" = FALSE))
parameters$illumina$bg.correct
[1] FALSE

We are going to preprocess the IDAT files and reference panel (3). We need to specify the IDAT files directory and the reference panel in RGChannelSet format. As a result, we will obtain a GenomicRatioSet object containing the control and case samples:

GRset <- epi_preprocess(baseDir,
reference_panel,
pattern = "SampleSheet.csv")

# 5 Epimutations

## 5.1 Epimutations detection

The function epimutations() includes 6 methods for epimutation identification: (1) Multivariate Analysis of variance (manova), (2) Multivariate Linear Model (mlm), (3) isolation forest (iForest), (4) robust mahalanobis distance (mahdist) (5) quantile and (6) beta.

To illustrate the following examples we are going to use the dataset methy (section ??). We need to separate the case and control samples:

case_samples <- methy[,methy$status == "case"] control_samples <- methy[,methy$status == "control"]

We can specify the chromosome or region to analyze which helps to reduce the execution time:

epi_mvo <- epimutations(case_samples,
control_samples,
method = "manova")

The function epimutations_one_leave_out() compared individual methylation profiles of a single sample (regardless if they are cases or controls) against all other samples from the same cohort. To use this function we do not need to split the dataset. To ilustrate this example we are going to use the GRset dataset available in epimutacions package:

#manova (default method)
data(GRset)
epi_mva_one_leave_out<- epimutations_one_leave_out(GRset)

## 5.2 Unique parameters

The epi_parameters() function is useful to set the individual parameters for each outliers detection approach. The following table describes the arguments:

parameters Description
Manova, mlm
pvalue_cutoff The threshold p-value to select which CpG regions are outliers
iso.forest
outlier_score_cutoff The threshold to select which CpG regions are outliers
ntrees The number of binary trees to build for the model
mahdist.mcd
nsamp The number of subsets used for initial estimates in the MCD
quantile
window_sz The maximum distance between CpGs to be considered in the same DMR
offset_mean/offset_abs The upper and lower threshold to consider a CpG an outlier
beta
pvalue_cutoff The minimum p-value to consider a CpG an outlier
diff_threshold The minimum methylation difference between the CpG and the mean methylation to consider a position an outlier

epi_parameters() with no arguments, returns a list of the default settings for each method:

epi_parameters()
$manova$manova$pvalue_cutoff [1] 0.05$mlm
$mlm$pvalue_cutoff
[1] 0.05

$iForest$iForest$outlier_score_cutoff [1] 0.7$iForest$ntrees [1] 100$mahdist
$mahdist$nsamp
[1] "deterministic"

$quantile$quantile$window_sz [1] 1000$quantile$offset_abs [1] 0.15$quantile$qsup [1] 0.995$quantile$qinf [1] 0.005$beta
$beta$pvalue_cutoff
[1] 1e-06

$beta$diff_threshold
[1] 0.1

The set up of any parameter can be done as the following example for manova:

parameters <- epi_parameters(manova = list("pvalue_cutoff" = 0.01))
parameters$manova$pvalue_cutoff
[1] 0.01

## 5.3 Results description

The epimutations function returns a data frame (tibble) containing all the epimutations identified in the given case sample. If no epimutation is found, the function returns a row containing the case sample information and missing values for all other arguments. The following table describes each argument:

Column name Description
epi_id systematic name for each epimutation identified
sample The name of the sample containing that epimutation
chromosome start end The location of the epimutation
sz The window’s size of the event
cpg_ids The number of CpGs in the epimutation
cpg_n The names of CpGs in the epimutation
outlier_score For method manova it provides the approximation to F-test and the Pillai score, separated by /
For method mlm it provides the approximation to F-test and the R2 of the model, separated by /
For method iForest it provides the magnitude of the outlier score.
For method beta it provides the mean p-value of all GpGs in that DMR
For methods quantile and mahdist it is filled with NA.
pvalue For methods manova and mlm it provides the p-value obtained from the model.
For method quantile, iForest, beta and mahdist it is filled with NA.
outlier_direction Indicates the direction of the outlier with “hypomethylation” and “hypermethylation”.
For manova, mlm, iForest, and mahdist it is computed from the values obtained from bumphunter.
For beta is computed from the p value for each CpG using diff_threshold and pvalue_threshold arguments.
For quantile it is computed from the location of the sample in the reference distribution (left vs. right outlier).
adj_pvalue For methods manova and mlm it provides the adjusted p-value with Benjamini-Hochberg based on the total number of regions detected by Bumphunter.
For method quantile, iForest, mahdist and beta it is filled with NA.
epi_region_id Name of the epimutation region as defined in candRegsGR.
CRE cREs (cis-Regulatory Elements) as defined by ENCODE overlapping the epimutation region.
CRE_type Type of cREs (cis-Regulatory Elements) as defined by ENCODE.

As an example we are going to visualize the obtained results with MANOVA method (epi_mvo):

dim(epi_mvo)
[1] 51 16
class(epi_mvo)
[1] "tbl_df"     "tbl"        "data.frame"
head(as.data.frame(epi_mvo), 1)
        epi_id     sample chromosome    start      end  sz cpg_n
1 epi_manova_1 GSM2562699      chr19 12777736 12777903 167     7
cpg_ids
1 cg20791841,cg25267526,cg03641858,cg25441478,cg14132016,cg23954461,cg03143365
outlier_score outlier_direction       pvalue
1 302.558383959434/0.981008923520811  hypermethylation 3.441085e-33
1 2.236705e-31  0.2960902 chr19_12776725 EH38E1939817,EH38E1939818,EH38E1939819
CRE_type
1 pELS;pELS,CTCF-bound;PLS,CTCF-bound

## 5.4 Epimutations annotations

The annotate_epimutations() function enriches the identified epimutations. It includes information about GENECODE gene names, description of the regulatory feature provided by methylation consortium, the location of the CpG relative to the CpG island, OMIM accession and description number and Ensembl region id, coordinates, type and tissue:

rst_mvo <- annotate_epimutations(epi_mvo)
Column name Description
GencodeBasicV12_NAME Gene names from the basic GENECODE build
Regulatory_Feature_Group Description of the regulatory feature provided by the Methylation Consortium
Relation_to_Island The location of the CpG relative to the CpG island
OMIM_ACC
OMIM_DESC
OMIM accession and description number
ensembl_reg_id
ensembl_reg_coordinates
ensembl_reg_type
ensembl_reg_tissues
The Ensembl region id, coordinates, type and tissue

## 5.5 Epimutation visualization

The plot_epimutations() function locates the epimutations along the genome. It plots the methylation values of the case sample in red, the control samples in dashed black lines and population mean in blue:

plot_epimutations(as.data.frame(epi_mvo[1,]), methy)