High-throughput chromosome conformation capture (Hi-C) technologies have revolutionized our understanding of 3D genome organization by mapping interactions between genomic loci. However, Hi-C data are inherently noisy and affected by experimental biases such as GC content, transposable elements, and DNA accessibility, which complicate the identification of biologically significant interactions.
The HiCPotts package provides a comprehensive framework for Bayesian analysis of Hi-C interaction data using a Hidden Markov Random Field (HMRF) model. Hi-C is a high-throughput sequencing technique that captures chromatin interactions across the genome, revealing spatial organization. This package models these interactions using a mixture of distributions (Poisson, Negative Binomial, Zero-Inflated Poisson, or Zero-Inflated Negative Binomial) while accounting for covariates, genomic distance, GC content, accessibility, and transposable element (TE) counts. The HMRF framework incorporates spatial dependencies via a Potts model, and the package employs Markov Chain Monte Carlo (MCMC) methods for parameter estimation.
An HMRF models spatial dependencies in a lattice (e.g., a matrix of Hi-C interactions) by assigning each site to one of several latent states (mixture components). The Potts model governs spatial interactions, encouraging neighboring sites to share the same state, controlled by an interaction parameter (\(\gamma\)).
The package uses MCMC to estimate model parameters (regression coefficients (\(\beta\)), zero-inflation parameter (\(\theta\)), dispersion parameter for Binomial distributions, and (\(\gamma\))) by sampling from their posterior distributions. Covariates like genomic distance and GC content adjust for biases in interaction counts. Each Hi-C interaction is modeled as belonging to one of three mixture components following a specified distribution.
The package performance was optimized through C++ implementations using Rcpp and RcppArmadillo. The package also supports parallel processing and flexible distribution choices, making it suitable for large-scale genomic analyses.
Most existing computational methods fail to adequately model the spatial dependencies and overdispersion in Hi-C contact matrices, limiting their ability to distinguish true signals from other components such as noise. The HiCPotts package addresses these challenges by providing a novel Bayesian framework to detect enriched interactions while accounting for experimental biases in Hi-C data. Its integration into Bioconductor is motivated by its robust statistical approach, computational efficiency, and ability to provide insights into bias sources, making it a valuable tool for researchers studying chromatin architecture in diverse biological contexts.
HiCPotts package also extends the existing knowledge of classifying interacting loci into two components to three components (true signal, false signal, and noise) and its focus on bias correction (DNA accessibility, transposable elements) enhance its utility for integrative genomic studies.
There are several Bioconductor packages that addresses Hi-C data analysis, each with distinct functionalities and scopes. We highlight the uniqueness of the HiCPotts package with other existing packages:
diffHic: The diffHic package focuses on detecting differential interactions between biological conditions using the edgeR framework for statistical modeling. diffHic provides methods for read pair alignment, binning, filtering, and normalization of biases (e.g., trended or CNV-driven). While diffHic excels at differential analysis, it does not explicitly model spatial dependencies in Hi-C data or account for overdispersion as HiCPotts does. HiCPotts is better suited for identifying enriched interactions within a Hi-C experiment and exploring bias sources, whereas diffHic is ideal for comparative studies across conditions.
HiCcompare and multiHiCcompare: HiCcompare offers joint normalization and difference detection for multiple Hi-C datasets, operating on sparse chromatin interaction matrices. multiHiCcompare extends this to handle multiple groups and replicates using cyclic loess normalization and a general linear model (GLM) based on edgeR. Both packages emphasize comparative analysis and normalization but do not focus on detecting enriched interactions within a dataset or modeling spatial dependencies. HiCPotts’s Bayesian approach and bias correction make it complementary, as it prioritizes significant interaction detection and bias insight over differential analysis.
HiCDCPlus: HiCDCPlus enables significant interaction calling and differential analysis for Hi-C and HiChIP data using a negative binomial generalized linear model. It includes tools for topologically associating domain (TAD) and A/B compartment calling, integrating with visualization tools. Like HiCPotts, it calls significant interactions, but HiCPotts’s HMRF-based model and ABC approach provide superior handling of spatial dependencies and computational tractability for chromosome-wide analysis. HiCDCPlus requires GC content information (computable internally), while HiCPotts additionally corrects for transposable elements and DNA accessibility biases.
scHiCcompare: Designed for single-cell Hi-C data, scHiCcompare supports imputation, normalization, and differential interaction analysis across single-cell datasets. Its focus on single-cell data makes it distinct from HiCPotts, which targets bulk Hi-C data. HiCPotts’s ability to model overdispersion and spatial dependencies is not replicated in scHiCcompare, which prioritizes single-cell-specific challenges.
Other Tools: Packages like HiCdat and HiCExperiment provide preprocessing, visualization, or data manipulation for Hi-C data but lack the statistical rigor of HiCPotts for interaction detection. HiCdat offers a graphical interface for preprocessing and integrative analysis with other omics data, while HiCExperiment provides data structures for 3C-related experiments. Neither focuses on enriched interaction detection or bias correction like HiCPotts.
HiCPotts allows researchers to identify significant intra-chromosomal interactions in Hi-C data while correcting for experimental biases. Its key features include:
Use of Zero-Inflated distributions to handle overdispersion.
An HMRF-based Bayesian framework with the Potts model for spatial dependency.
ABC for computationally efficient handling of the Potts model’s normalizing constant.
Bias correction for GC content, transposable elements, and DNA accessibility.
Unlike diffHic, HiCcompare, and multiHiCcompare, which focus on differential analysis, or HiCDCPlus, which balances interaction calling and differential analysis, HiCPotts prioritizes enriched interaction detection. Its bias correction and spatial modeling make it a powerful complement to existing tools, enhancing Bioconductor’s suite for 3D genome analysis. HiCPotts can be used for initial interaction detection, followed by diffHic or HiCcompare for differential studies, creating a comprehensive Hi-C analysis workflow.
HiCPotts depends on several CRAN and Bioconductor packages.
Install them as follows:
Install CRAN dependencies
install.packages(c("Rcpp", "RcppArmadillo", "parallel"))
# Install Bioconductor dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c("rhdf5", "strawr", "rtracklayer", "GenomicRanges", "BSgenome", "Biostrings"))
# Loading the package:
# install.packages(HiCPotts)
library(HiCPotts)
The HiCPotts workflow involves four main steps:
Data Loading: Use get_data()
to read Hi-C contact matrices (.hic
, .cool
, .h5
) and annotate bins with GC content, accessibility, and TE counts.
Data Processing: Use process_data()
to convert the data into (N N) matrices of interactions and covariates.
MCMC Simulation: Use run_chain_betas()
to run MCMC chains, estimating parameters and latent state assignments.
Probability Computation: Use compute_HMRFHiC_probabilities()
to calculate posterior probabilities of component assignments.
We’ll demonstrate this workflow using synthetic Hi-C data for simplicity.
The get_data()
function reads Hi-C contact matrices and annotates bins with covariates. For this example, we simulate a small dataset instead of using real Hi-C files, which require specific file formats and genome annotations.
Simulate a 10x10 Hi-C dataset
In practice, the get_data()
function is used to load real Hi-C data from .cool, .mcool, or .hic files, similar to the HiCExperiment package. Both convert these files into structured outputs, but they differ in their results. HiCExperiment generates a HiCExperiment object that includes a contact matrix, genomic regions, metadata (e.g., resolution, chromosome), and pairwise interactions. In contrast, get_data()
produces a data frame from the same file types, incorporating optional calculations (if available) to identify sources of experimental biases, such as GC content or DNA accessibility. Also the function support loading profiles as bigwig or bedgraph and will be imported as GRanges.
The process_data()
function converts the data frame generated through the prepare_data()
function into a list of (N \(\times\) N) matrices for interactions (y
) and covariates (x_vars
), optionally scaling interaction counts. Also, if a HiCExperiment object is already available and the sources of biases are known and organized in a separate data frame, these can be combined into a single data frame using base R functions and the prepare_data()
function to convert into a list for analysis.
This produces matrices for distance, GC, TEs, ACC, and interaction counts, ready for MCMC.
The run_chain_betas()
function runs MCMC simulations to estimate parameters ((\(\beta\)), (\(\gamma\)), (\(\theta\) (for Zero-Inflated distributions)), size(for Negative binomial distributions)) and latent state assignments ((z)). In the example below, we use the Zero-Inflated Negative Binomial (ZINB) distribution to model the simulated sparse Hi-C data.
The output includes chains for regression parameters (chains
), the Potts interaction parameter (gamma
), zero-inflation parameter (theta
), and dispersion parameters (size
).
The compute_HMRFHiC_probabilities()
function calculates posterior probabilities for each interaction belonging to one of three mixture components, using the MCMC chains.
To use user-specified priors instead of data-driven priors:
For multiple Hi-C matrices, use mc_cores to parallelize:
The package supports Poisson, NB, ZIP, and ZINB distributions. For example, to use Poisson:
Performance: The following functions, pz_123()
, run_metropolis_MCMC_betas()
, and Neighbours_combined()
, are implemented in C++ using Rcpp and RcppArmadillo for efficiency, crucial for large Hi-C matrices.
Utility Functions: The following functions, proposaldensity_combined()
, likelihood_gamma()
, gamma_prior_value()
, posterior_combined()
, prior_combined()
, likelihood_combined()
, proposalfunction()
, and size_prior()
, support the MCMC process and are typically not called directly by users.
The HiCPotts package offers a powerful tool for Bayesian analysis of Hi-C data, integrating spatial dependencies, information from sources of bias associated with Hi-C data, and flexible mixture models. This page covered the core workflow, but the package’s functions can be customized for specific research needs, such as different genomic regions or distribution assumptions.
Users trying to bring in extra covariates; beyond genomic distance, GC content, ACC-score and TEs, can do so by classifying each new factor according to how it relates to those four (for example, as a distance-like term, a sequence-composition term, or an interaction‐score term) and then inputting it into the same model framework. Also, the package currently supports only intrachromosomal analyses. In the future, we plan to extend to interchromosomal contacts.
For further details, consult the package documentation (?HiCPotts
) or contact the package maintainers. We hope HiCPotts facilitates your genomic research!
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] HiCPotts_0.99.5 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4
## [3] SparseArray_1.9.1 bitops_1.0-9
## [5] lattice_0.22-7 digest_0.6.37
## [7] evaluate_1.0.4 grid_4.5.1
## [9] bookdown_0.43 fastmap_1.2.0
## [11] jsonlite_2.0.0 Matrix_1.7-3
## [13] restfulr_0.0.16 BiocManager_1.30.26
## [15] httr_1.4.7 BSgenome_1.77.1
## [17] XML_3.99-0.18 Biostrings_2.77.2
## [19] codetools_0.2-20 jquerylib_0.1.4
## [21] abind_1.4-8 cli_3.6.5
## [23] rlang_1.1.6 crayon_1.5.3
## [25] XVector_0.49.0 Biobase_2.69.0
## [27] cachem_1.1.0 DelayedArray_0.35.2
## [29] yaml_2.3.10 S4Arrays_1.9.1
## [31] tools_4.5.1 parallel_4.5.1
## [33] BiocParallel_1.43.4 Rhdf5lib_1.31.0
## [35] Rsamtools_2.25.2 SummarizedExperiment_1.39.1
## [37] BiocGenerics_0.55.1 curl_6.4.0
## [39] R6_2.6.1 rhdf5_2.53.4
## [41] BiocIO_1.19.0 matrixStats_1.5.0
## [43] stats4_4.5.1 lifecycle_1.0.4
## [45] rtracklayer_1.69.1 Seqinfo_0.99.2
## [47] S4Vectors_0.47.0 IRanges_2.43.0
## [49] strawr_0.0.92 bslib_0.9.0
## [51] Rcpp_1.1.0 xfun_0.52
## [53] GenomicRanges_1.61.1 GenomicAlignments_1.45.2
## [55] rhdf5filters_1.21.0 MatrixGenerics_1.21.0
## [57] knitr_1.50 rjson_0.2.23
## [59] htmltools_0.5.8.1 rmarkdown_2.29
## [61] compiler_4.5.1 BSgenome.Dmelanogaster.UCSC.dm6_1.4.1
## [63] RCurl_1.98-1.17