1 Workflow

1.1 How to cite epigraHMM

If you use epigraHMM in published research for consensus peak calling of epigenomic data, please cite:

Baldoni, PL, Rashid, NU, Ibrahim, JG. Improved detection of epigenomic marks with mixed‐effects hidden Markov models. Biometrics. 2019; 75: 1401–1413. https://doi.org/10.1111/biom.13083

If epigraHMM is used in published research for differential peak calling of epigenomic data, please cite:

Baldoni, PL, Rashid, NU, Ibrahim, JG. Efficient Detection and Classification of Epigenomic Changes Under Multiple Conditions. Biometrics (in press). https://doi.org/10.1111/biom.13477

1.2 How to get help for epigraHMM

Exported functions from epigraHMM are fully documented. Users looking for help for a particular epigraHMM function can use the standard R help, such as help(epigraHMM). Questions, bug reports, and suggestions can be sent directly to the authors through the Bioconductor support site https://support.bioconductor.org. Guidelines for posting on the support site can be found at http://www.bioconductor.org/help/support/posting-guide. Users should not request support via direct email to the authors.

1.4 Data input

epigraHMM takes as data input either a matrix of non-negative counts or binary alignment map (BAM) files. Regardless of the choice of input format, epigraHMM allows the user to input data from both epigenomic experiments (e.g. ChIP-seq, ATAC-seq) and controls (e.g. controls without immunoprecipitation). To input data in the count matrix format, users should use the function epigraHMMDataSetFromMatrix. Alternatively, users may use the function epigraHMMDataSetFromBam to input data in the BAM format.

Either way, the output will be an epigraHMMDataSet object that is used to store the input data, the model offsets, and the results from the peak calling algorithms. Specifically, epigraHMMDataSet is a RangedSummarizedExperiment from which one can access the information about genomic coordinates and samples with the functions rowRanges and colData. Counts from epigenomic experiments are stored in the epigraHMMDataSet’s assay ‘counts’.

1.4.1 Count matrix input

For input matrices, counts should be organized in a ‘features by samples’ format, i.e., rows represent genomic coordinates and columns represent samples. One can use the function epigraHMMDataSetFromMatrix to create an epigraHMMDataSet from matrices of read counts, which takes as input a matrix (or list of matrices) of non-negative integers (argument countData), a data.frame with the information about the samples (argument colData), and an optional GRanges object with the genomic coordinates associated with the matrix of counts (argument rowRanges).

If countData is a list of matrices, countData must be a named list and contain (at least) a matrix counts of read counts pertaining to the epigenomic experiment of interest (ChIP-seq, ATAC-seq, etc.). If additional matrices are included in the countData list, they can have any desired name such as gc and controls, for example1 Users interested in accounting for input control samples when calling consensus peaks should include a named matrix of reads counts controls in the list countData. epigraHMM will search for a controls matrix in the input data and include input control counts in the linear model, if present..

The input colData must contain variables named as condition and replicate. The variable condition refers to the experimental condition identifier (e.g. cell line name). The variable replicate refers to the replicate identification number (unique for each condition). Additional columns included in the colData input will be passed to the resulting epigraHMMDataSet object and can be accessed via colData function.

## class: RangedSummarizedExperiment 
## dim: 100000 4 
## metadata(0):
## assays(3): counts offsets controls
## rownames: NULL
## rowData names(0):
## colnames(4): A.1 A.2 B.1 B.2
## colData names(2): condition replicate

1.4.2 Alignment file input

One can use the function epigraHMMDataSetFromBam to create an epigraHMMDataSet from a set of alignment files in BAM format (argument bamFiles). Additional inputs include a data.frame with the information about the samples (argument colData), the reference genome of interest (argument genome), the size of the genomic windows where read counts will be computed (argument windowSize), and optional logicals indicating whether to exclude genomic coordinates associated with either gap or blacklisted regions (arguments gapTrack and blackList; Amemiya, Kundaje, and Boyle (2019))2 These arguments can also be GRanges objects (see below for details).

The input argument bamFiles specifies the path to the experimental files in BAM format. bamFiles can be either a character vector or a named list of character vectors with the path for BAM files. If bamFiles is a list of character vectors, it must be a named list and contain (at least) a character vector counts pertaining to the path of the epigenomic experiment of interest (ChIP-seq, ATAC-seq, etc.). If additional character vectors are included in the bamFiles list, they can have any desired name such as controls, for example3 Users interested in accounting for input control samples when calling consensus peaks should include a named character vector controls in the list bamFiles indicating the path to the input control BAM files. epigraHMM will search for a controls character vector in the input data and include input control counts in the linear model, if present.. The alignment index ‘.bai’ files must be stored in the same directory of their respective BAM files and must be named after their respective BAM files with the additional ‘.bai’ suffix. When computing read counts, the fragment length will be estimated using csaw cross-correlation analysis with default parameters after discarding any gap or black listed regions.

The input colData must contain variables named as condition and replicate. The variable condition refers to the experimental condition identifier (e.g. cell line name). The variable replicate refers to the replicate identification number (unique for each condition). Additional columns included in the colData input will be passed to the resulting epigraHMMDataSet object and can be accessed via colData function.

The input genome can be either a single string with the name of the reference genome (e.g. ‘hg19’) or a GRanges object with the chromosome lengths of the reference genome. By default, the function epigraHMMDataSetFromBam calls GenomeInfoDb::Seqinfo to fetch the chromosome lengths of the specified genome. See ?GenomeInfoDb::fetchExtendedChromInfoFromUCSC for the list of UCSC genomes that are currently supported. The input windowSize must be an integer value specifying the size of genomic windows where read counts will be computed.

If the optional gapTrack = TRUE and the name of a reference genome is passed as input through genome (e.g. ‘hg19’), epigraHMMDataSetFromBam will discard any genomic coordinate overlapping regions specified by the respective UCSC gap table. If gapTrack is a GRanges object, the function will discard any genomic coordinate overlapping regions of gapTrack.

If the optional blackList = TRUE and the name of a reference genome is passed as input through genome (e.g. ‘hg19’), epigraHMMDataSetFromBam will fetch the curated ENCODE blacklist tracks from the Bioconductor package GreyListChIP. Current available genomes are ce10, dm3, hg19, hg38, and mm10. If blackList is a GRanges object, the function will discard any genomic coordinate overlapping regions from blackList

## class: RangedSummarizedExperiment 
## dim: 162657 2 
## metadata(0):
## assays(2): counts offsets
## rownames: NULL
## rowData names(0):
## colnames(2): SHR.1 SHR.2
## colData names(3): condition replicate fragLength

1.5 Data normalization

The function normalizeCounts implements a non-linear normalization via model offsets. It takes as input either an epigraHMMDataSet object or a matrix of non-negative read counts (input object). Specifically, the normalization method is based on a loess smoothing fit comparing the difference (M from MA plot) and average (A from MA plot) of each sample (log-transformed counts + 1) with a reference sample created as the row-wise log-transformed geometric mean. Here, the resulting loess smoothing fit is used as an offset term in the epigraHMM model. We strongly recommend users to utilize normalizeCounts in their analyses as epigenomic data sets are often subject to non-linear biases. That is, local differences in read count distribution between samples vary with the overall local read abundance (Lun and Smyth 2015).

The current implementation in epigraHMM uses the function loessFit from limma in a similar fashion to csaw::normOffsets. Users might pass further arguments to loessFit through the three dot ellipsis .... For instance, users might find useful to try different proportions of the data to be used in the local regression window (argument span, a positive value between 0 and 1, with larger numbers giving smoother fits). We find that span=1 (default) in normalizeCounts gives the best results in both broad and short epigenomic marks (Baldoni, Rashid, and Ibrahim 2021).

##      SHR.1             SHR.2        
##  Min.   :  0.000   Min.   :  0.000  
##  1st Qu.:  0.000   1st Qu.:  0.000  
##  Median :  0.000   Median :  1.000  
##  Mean   :  2.056   Mean   :  4.751  
##  3rd Qu.:  1.000   3rd Qu.:  2.000  
##  Max.   :208.000   Max.   :247.000
##   SHR.1   SHR.2 
## 3.34471 7.72821
##      SHR.1             SHR.2         
##  Min.   :  0.000   Min.   :  0.0000  
##  1st Qu.:  0.000   1st Qu.:  0.0000  
##  Median :  0.000   Median :  0.8041  
##  Mean   :  2.899   Mean   :  3.1285  
##  3rd Qu.:  1.289   3rd Qu.:  1.4816  
##  Max.   :276.867   Max.   :173.1898
##    SHR.1    SHR.2 
## 4.715501 5.088707

1.5.1 Other types of normalization (e.g. GC-content and control assays)

epigraHMM allows users to input their own normalization offsets via addOffsets, which simply adds an input matrix of normalizing offsets to a given epigraHMMDataSet. For users interested in adjusting their analyses with both addOffsets and normalizeCounts, we recommend normalizeCounts to be used as the last normalization step just prior to peak calling. This is because normalizeCounts will normalize counts while considering any already existing offsets (such as those from GC-content normalizing offsets, see below) in the epigraHMMDataSet object.

The function addOffsets can be useful for users that may want to adjust their analyses for GC-content bias, for example. GC-content normalizing offsets could be obtained from Bioconductor packages such as gcapc.

Note that, in the example below, gcapc::refineSites will fetch data from the Bioconductor package BSgenome.Rnorvegicus.UCSC.rn4, which must be installed locally. For users interested in utilizing gcapc for GC-content normalization, we strongly recommend them to follow the suggested analysis steps from the authors’ vignette.

We note that addOffsets will add offsets to any existing offset assay contained in epigraHMMDataSet. That is, in the example above, the resulting offsets from the output of addOffsets(object_gcExample,offsets = gcnorm_offsets) will be equal to assay(object_gcExample,'offsets') + gcnorm_offsets.

Alternatively, epigraHMM may account for input control experiments when calling significant regions of enrichment of a given condition. To this end, epigraHMM directly models the effect of input control read counts in its HMM-embedded generalized linear model. Users interested in utilizing input control experiments in their analyses should pass either matrices of read counts or the paths to the control BAM files in the ‘controls’ entry of the input list bamFiles as below. To speed up the computing time, I utilize a window size of 1000 base pairs.

## class: RangedSummarizedExperiment 
## dim: 38778 1 
## metadata(0):
## assays(3): counts offsets controls
## rownames: NULL
## rowData names(0):
## colnames(1): BN.1
## colData names(3): condition replicate fragLength

1.6 Peak calling

Peak calling, either consensus or differential, is performed in epigraHMM through the function epigraHMM. It takes as input an epigraHMMDataSet (argument object), a list of control parameters (argument control), the type of peak calling (argument type), and the distributional assumption for the data (argument dist).

The argument object passes the epigraHMMDataSet to the peak calling algorithms. If either controls experiments or normalization offsets are included in the input object, they will be used in the analysis. Specifically, epigraHMM directly models controls as a covariate in the count mean model in consensus peak calling.

Users can specify the type of peak calling, either consensus or differential, via the argument type. If type='consensus', epigraHMM will detect enrichment regions in consensus across technical or biological replicates. It assumes that all available data stored in the epigraHMMDataSet is generated under the same experimental conditions (e.g. unique cell line) and the genome can be segmented into either consensus background or consensus enrichment regions. If type='differential', epigraHMM will detect differential enrichment regions across technical or biological replicates from multiple conditions (e.g. several cell lines or knockout versus wild-type). In this case, it will assume that the genome can be segmented into regions of either consensus background, differential, or consensus enrichment.

The argument dist specifies the probabilistic distribution for the experimental counts. The distribution can be either zero-inflated negative binomial (ZINB; dist='zinb') or a negative binomial model (NB; dist='nb'). If dist='zinb', counts from the consensus background (enrichment) hidden Markov model (HMM) state will be modeled with a ZINB (NB). If dist='nb', both consensus enrichment and background will be modeled with a NB distribution. For specific details of the model, we refer users to our publications (Baldoni, Rashid, and Ibrahim 2019) and (Baldoni, Rashid, and Ibrahim 2021). We recommend users to specify dist='zinb' if consensus peak calling is of interest, as we found the ZINB to give better results in this setting. Minor differences between ZINB and NB models were observed in differential peak calling. No significant differences in computing time were observed between the ZINB and NB model specifications.

The control argument should be a list of parameter specifications generated from the function controlEM. Possible tuning parameters from controlEM include the maximum number of EM algorithm iterations, the convergence criteria, the option to print log messages during the EM algorithm, etc. We recommend users to read the manual via ?controlEM for all parameter specifications. For any standard analysis, either consensus or differential peak calling, we recommend users to simply pass control=controlEM() to epigraHMM.

1.6.2 Multi-sample, multiple-condition analysis

If one is interested in detecting differential peaks across multiple samples collected under different conditions, epigraHMM can be used with the option type = 'differential' as below. Note that it is not mandatory for experimental conditions to have more than one technical or biological replicates. In principle, epigraHMM is able to call differential peaks under single-sample multi-condition designs. However, users are strongly encouraged to utilized as many technical/biological replicates per condition as possible in their analyses. epigraHMM provides better performance regarding sensitivity and false discovery rate (FDR) control when more replicates are utilized (see Web Figures 13-15 in (Baldoni, Rashid, and Ibrahim 2021)). To speed up the computing time, I utilize a window size of 1000 base pairs.

1.6.3 Calling peaks

Consensus or differential peaks can be called with epigraHMM’s callPeaks function, which takes as input the epigraHMM object output (argument object). By default, the most likely (consensus or differential) peak regions are presented by callPeaks, which utilizes the Viterbi algorithm to this end. Alternatively, users may want to specify a particular FDR control thresholding level through the argument method. For example method = 0.05 requests callPeaks to define peak regions while controlling for the FDR of 0.05 on the window level. Neighboring significant windows that pass a given FDR threshold level are merged to form consensus or differential regions of enrichment.

## GRanges object with 1886 ranges and 1 metadata column:
##          seqnames            ranges strand |        name
##             <Rle>         <IRanges>  <Rle> | <character>
##      [1]    chr12       42001-48000      * |       peak1
##      [2]    chr12       49001-52000      * |       peak2
##      [3]    chr12       57001-65000      * |       peak3
##      [4]    chr12       66001-96000      * |       peak4
##      [5]    chr12      97001-100000      * |       peak5
##      ...      ...               ...    ... .         ...
##   [1882]    chr12 46239001-46376000      * |    peak1882
##   [1883]    chr12 46467001-46490000      * |    peak1883
##   [1884]    chr12 46688001-46696000      * |    peak1884
##   [1885]    chr12 46723001-46755000      * |    peak1885
##   [1886]    chr12 46760001-46782294      * |    peak1886
##   -------
##   seqinfo: 45 sequences (1 circular) from rn4 genome
## GRanges object with 7 ranges and 1 metadata column:
##       seqnames            ranges strand |        name
##          <Rle>         <IRanges>  <Rle> | <character>
##   [1]    chr12   1756001-1774000      * |       peak1
##   [2]    chr12 19888001-19892000      * |       peak2
##   [3]    chr12 19899001-19909000      * |       peak3
##   [4]    chr12 19976001-19991000      * |       peak4
##   [5]    chr12 20041001-20050000      * |       peak5
##   [6]    chr12 20057001-20061000      * |       peak6
##   [7]    chr12 20571001-20576000      * |       peak7
##   -------
##   seqinfo: 45 sequences (1 circular) from rn4 genome

1.7 Data visualization

1.7.1 Plotting peak tracks

Peaks can be visualized with the epigraHMM function plotCounts, which accepts annotation tracks to be included in the output plot. Below, we show an example on how to visualize peaks calls.

First, we will fetch the UCSC gene bodies from the rn4 genome. They will be then transformed into a GRanges object to be used by epigraHMM as an annotation track.

To visualize the consensus or differential peak calls, one needs to provide the output of epigraHMM (argument object), the genomic coordinates to be visualized (argument ranges), the set of peak calls (argument peaks, optional), and the annotation track (argument annotation, optional). The resulting plot will display the peak track, the annotation track, the normalized read counts, and the posterior probabilities associated with either consensus enrichment (for consensus peaks) or differential enrichment (for differential peaks). Read counts are normalized with the normalizing offsets contained in the epigraHMM output object.

Below, we have an example of differential peak calls between conditions BN and SHR for the histone modification mark H3K27me3.