YY1 ChIA-PET motif analysis (step-by-step)

Jennifer Hammelman, Konstantin Krismer

2021-07-12

set.seed(17)
library(spatzie)

This vignette describes how to use spatzie to identify pairs of transcription factors whose sequence motifs (that describe their binding sites) are co-enriched in enhancers and promoters that interact with each other. ChIA-PET (Fullwood and Ruan 2009), HiChIP (Mumbach et al. 2016) or Hi-C(Lieberman-Aiden et al. 2009) are molecular biology assays commonly used to investigate long-range genomic interactions and the data they generate, once properly processed (BEDPE format), serves as input to spatzie co-enrichment analyses. In this case we used interactions data in BEDPE format based on a ChIA-PET assay targeting the transcription factor YY1.

Interactions data in BEDPE format is a tab-separated file, where each line describes one interaction between two anchors, i.e., two regions of the genome that are potentially far away from each other.

In order to find motif pairs co-enriched in enhancers and promoters that interact with each other, we first need to annotate all interaction anchors and discard interactions that are not between enhancers and promoters.

Genome assembly specific configuration

The interactions data was aligned to the mouse genome assembly mm9 (i.e., the coordinates in the BEDPE file are mm9 genome coordinates). The annotation package BSgenome.Mmusculus.UCSC.mm9 contains the genomic sequence, which we need in order to detect transcription factor motifs in the interaction anchor regions. TxDb.Mmusculus.UCSC.mm9.knownGene contains promoter annotations.

genome_id <- "BSgenome.Mmusculus.UCSC.mm9"
if (!(genome_id %in% rownames(utils::installed.packages()))) {
  BiocManager::install(genome_id, update = FALSE, ask = FALSE)
}
genome <- BSgenome::getBSgenome(genome_id)


txdb_id <- "TxDb.Mmusculus.UCSC.mm9.knownGene"
if (!(txdb_id %in% rownames(utils::installed.packages()))) {
  BiocManager::install(txdb_id, update = FALSE, ask = FALSE)
}
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene::TxDb.Mmusculus.UCSC.mm9.knownGene

ensembl_data_set <- "mmusculus_gene_ensembl"
gene_symbol <- "mgi_symbol"

Load data

For this vignette we load toy BEDPE example data from a ChIA-PET experiment in murine embryonic stem cells.

yy1_interactions_file <- system.file("extdata/yy1_interactions.bedpe.gz",
                                     package = "spatzie")
yy1_interactions <- GenomicInteractions::makeGenomicInteractionsFromFile(
  yy1_interactions_file,
  type = "bedpe",
  experiment_name = "yy1",
  description = "mESC yy1 chr1")
length(yy1_interactions)
## [1] 40000

Load mouse promoter information

Unless the interactions data was preprocessed in such a way to include only enhancers in one anchor and only promoters in the other (or any other meaningful grouping), it is imperative to filter out non-enhancer-promoter interactions. To do this, we annotate interaction anchors as either close to a promoter or promoter-distal (i.e., putative enhancer).

Here, promoter regions are loaded, including the surrounding 2500 base pairs.

promoter_ranges <- GenomicFeatures::promoters(txdb,
                                              upstream = 2500,
                                              downstream = 2500,
                                              columns = c("tx_name", "gene_id"))
# trims out-of-bound ranges located on non-circular sequences
promoter_ranges <- GenomicRanges::trim(promoter_ranges)
# remove duplicate promoters from transcript isoforms
promoter_ranges <- BiocGenerics::unique(promoter_ranges)

promoters_df <- as.data.frame(promoter_ranges)
promoters_df$gene_id <- as.character(promoters_df$gene_id)

Annotate interactions with promoters

We use annotateInteractions from the GenomicInteractions package to annotate the interactions with the promoters obtained in the previous step, thus splitting the interactions into three groups: (1) interactions between promoter regions and regions far away from promoters (i.e., putative enhancers), (2) interactions between two promoter regions, and (3) interactions between two promoter-distal regions.

annotation_features <- list(promoter = promoter_ranges)
GenomicInteractions::annotateInteractions(yy1_interactions, annotation_features)
GenomicInteractions::plotInteractionAnnotations(yy1_interactions)

Limit and sort interactions to enhancer:promoter

As we have seen above, the interactions data contains not only interactions between enhancers and promoters. Here all interactions that are not between enhancers and promoters are discarded and enhancer-promoter interactions are sorted in a way that the promoters are always in anchor 1 and the enhancers are always in anchor 2.

distal_promoter_idx <- GenomicInteractions::isInteractionType(
  yy1_interactions, "distal", "promoter")
yy1_pd <- yy1_interactions[distal_promoter_idx]
anchor1 <- GenomicInteractions::anchorOne(yy1_pd)
anchor2 <- GenomicInteractions::anchorTwo(yy1_pd)

promoter_left <- S4Vectors::elementMetadata(anchor1)[, "node.class"] == "promoter"
promoter_right <- S4Vectors::elementMetadata(anchor2)[, "node.class"] == "promoter"
promoter_ranges <- c(anchor1[promoter_left],
                     anchor2[promoter_right])
enhancer_ranges <- c(anchor2[promoter_left],
                     anchor1[promoter_right])
yy1_pd <- GenomicInteractions::GenomicInteractions(promoter_ranges,
                                                   enhancer_ranges)

Scan motifs

For the purpose of this vignette, we use a toy motif database. The HOCOMOCO motif database (Kulakovskiy et al. 2018) is commonly used, but any motif file compatible with TFBSTools::readJASPARMatrix() can be used.

spatzie::scan_motifs() takes the filtered interactions data and scans the anchors of each interaction with the motifs of the provided motif database, usually a set of transcription factor sequence motifs.

spatzie::filter_motifs() operates on the object returned by spatzie::scan_motifs() and removes motifs that are present in - in this case - fewer than 40% of the interactions.

motifs_file <- system.file("extdata/motifs_subset.txt.gz",
                           package = "spatzie")
motifs <- TFBSTools::readJASPARMatrix(motifs_file, matrixClass = "PFM")
yy1_pd_interaction <- scan_motifs(yy1_pd, motifs, genome)
yy1_pd_interaction <- filter_motifs(yy1_pd_interaction, 0.4)

Compute motif co-significance

spatzie::anchor_pair_enrich() operates on the object returned by spatzie::filter_motifs() and calculates the co-enrichment of all motif pairs. It supports three methods to determine co-enrichment, count-based correlation, score-based correlation, and match-based assocation. For details, see the help page (?spatzie::anchor_pair_enrich) or the spatzie paper (citation("spatzie")).

Correlation between motif counts

plot of chunk is_compute_cosignificance_count_save

Correlation between max motif scores

plot of chunk is_compute_cosignificance_score_save

YY1 binds enhancer and promoter sites providing scaffolding that forms enhancer-promoter interactions in mouse stem cells (Weintraub et al. 2017). As expected, spatzie identified a statistically significant co-occurrence of YY1 motifs indicating this dependency.

When interpreting spatzie results, keep in mind that motif databases such as HOCOMOCO often include groups of transcription factors with highly similar DNA-binding motifs (in this example YY1 and ZF.5), and the putative co-enrichment of one pair of transcription factor binding sites might be explained by another pair with highly similar motifs.

Please note that the motifs and the interactions data used in this vignette are dummy data used for demonstration purposes only.

Additional information

Most of the functionality of the spatzie package is also offered through the website at https://spatzie.mit.edu.

For a more detailed discussion on spatzie, please have a look at the paper:

spatzie: An R package for identifying significant transcription factor motif co-enrichment from enhancer-promoter interactions Jennifer Hammelman, Konstantin Krismer, and David K. Gifford bioRxiv 2021.05.25.445606; DOI: https://doi.org/10.1101/2021.05.25.445606

Session info

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.60.0                   rtracklayer_1.52.0               
##  [4] Biostrings_2.60.1                 XVector_0.32.0                    GenomicRanges_1.44.0             
##  [7] GenomeInfoDb_1.28.1               IRanges_2.26.0                    S4Vectors_0.30.0                 
## [10] BiocGenerics_0.38.0               spatzie_0.99.6                    usethis_2.0.1                    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2                        rjson_0.2.20                            ellipsis_0.3.2                         
##   [4] biovizBase_1.40.0                       htmlTable_2.2.1                         base64enc_0.1-3                        
##   [7] fs_1.5.0                                dichromat_2.0-0                         rstudioapi_0.13                        
##  [10] farver_2.1.0                            bit64_4.0.5                             AnnotationDbi_1.54.1                   
##  [13] fansi_0.5.0                             xml2_1.3.2                              splines_4.1.0                          
##  [16] R.methodsS3_1.8.1                       cachem_1.0.5                            knitr_1.33                             
##  [19] Formula_1.2-4                           Rsamtools_2.8.0                         seqLogo_1.58.0                         
##  [22] annotate_1.70.0                         cluster_2.1.2                           GO.db_3.13.0                           
##  [25] dbplyr_2.1.1                            png_0.1-7                               R.oo_1.24.0                            
##  [28] pheatmap_1.0.12                         readr_1.4.0                             compiler_4.1.0                         
##  [31] httr_1.4.2                              backports_1.2.1                         lazyeval_0.2.2                         
##  [34] assertthat_0.2.1                        Matrix_1.3-4                            fastmap_1.1.0                          
##  [37] htmltools_0.5.1.1                       prettyunits_1.1.1                       tools_4.1.0                            
##  [40] igraph_1.2.6                            gtable_0.3.0                            glue_1.4.2                             
##  [43] TFMPvalue_0.0.8                         GenomeInfoDbData_1.2.6                  reshape2_1.4.4                         
##  [46] dplyr_1.0.7                             rappdirs_0.3.3                          Rcpp_1.0.7                             
##  [49] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2 Biobase_2.52.0                          vctrs_0.3.8                            
##  [52] xfun_0.24                               CNEr_1.28.0                             stringr_1.4.0                          
##  [55] lifecycle_1.0.0                         ensembldb_2.16.2                        restfulr_0.0.13                        
##  [58] poweRlaw_0.70.6                         gtools_3.9.2                            InteractionSet_1.20.0                  
##  [61] XML_3.99-0.6                            zlibbioc_1.38.0                         scales_1.1.1                           
##  [64] VariantAnnotation_1.38.0                ProtGenerics_1.24.0                     GenomicInteractions_1.26.0             
##  [67] hms_1.1.0                               MatrixGenerics_1.4.0                    SummarizedExperiment_1.22.0            
##  [70] AnnotationFilter_1.16.0                 RColorBrewer_1.1-2                      yaml_2.2.1                             
##  [73] curl_4.3.2                              memoise_2.0.0                           gridExtra_2.3                          
##  [76] ggplot2_3.3.5                           biomaRt_2.48.2                          rpart_4.1-15                           
##  [79] latticeExtra_0.6-29                     stringi_1.6.2                           RSQLite_2.2.7                          
##  [82] highr_0.9                               BiocIO_1.2.0                            checkmate_2.0.0                        
##  [85] GenomicFeatures_1.44.0                  caTools_1.18.2                          filelock_1.0.2                         
##  [88] BiocParallel_1.26.1                     rlang_0.4.11                            pkgconfig_2.0.3                        
##  [91] matrixStats_0.59.0                      bitops_1.0-7                            evaluate_0.14                          
##  [94] pracma_2.3.3                            lattice_0.20-44                         purrr_0.3.4                            
##  [97] labeling_0.4.2                          htmlwidgets_1.5.3                       GenomicAlignments_1.28.0               
## [100] bit_4.0.4                               tidyselect_1.1.1                        plyr_1.8.6                             
## [103] magrittr_2.0.1                          R6_2.5.0                                generics_0.1.0                         
## [106] Hmisc_4.5-0                             DelayedArray_0.18.0                     DBI_1.1.1                              
## [109] pillar_1.6.1                            foreign_0.8-81                          survival_3.2-11                        
## [112] KEGGREST_1.32.0                         RCurl_1.98-1.3                          nnet_7.3-16                            
## [115] tibble_3.1.2                            crayon_1.4.1                            utf8_1.2.1                             
## [118] BiocFileCache_2.0.0                     rmarkdown_2.9                           jpeg_0.1-8.1                           
## [121] progress_1.2.2                          TFBSTools_1.30.0                        grid_4.1.0                             
## [124] data.table_1.14.0                       blob_1.2.1                              digest_0.6.27                          
## [127] xtable_1.8-4                            R.utils_2.10.1                          munsell_0.5.0                          
## [130] DirichletMultinomial_1.34.0             motifmatchr_1.14.0                      Gviz_1.36.2

References

Fullwood, M. J., and Y. Ruan. 2009. “ChIP-based methods for the identification of long-range chromatin interactions.” J. Cell. Biochem. 107 (1): 30–39.

Kulakovskiy, Ivan V, Ilya E Vorontsov, Ivan S Yevshin, Ruslan N Sharipov, Alla D Fedorova, Eugene I Rumynskiy, Yulia A Medvedeva, et al. 2018. “HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis.” Nucleic Acids Research 46 (D1): D252–D259. https://doi.org/10.1093/nar/gkx1106.

Lieberman-Aiden, E., N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, et al. 2009. “Comprehensive mapping of long-range interactions reveals folding principles of the human genome.” Science 326 (5950): 289–93.

Mumbach, M. R., A. J. Rubin, R. A. Flynn, C. Dai, P. A. Khavari, W. J. Greenleaf, and H. Y. Chang. 2016. “HiChIP: efficient and sensitive analysis of protein-directed genome architecture.” Nat. Methods 13 (11): 919–22.

Weintraub, Abraham S, Charles H Li, Alicia V Zamudio, Alla A Sigova, Nancy M Hannett, Daniel S Day, Brian J Abraham, Malkiel A Cohen, Behnam Nabet, and Dennis L Buckley. 2017. “YY1 is a structural regulator of enhancer-promoter loops.” Cell 171 (7): 1573–88.