dinoR-vignette

Michaela Schwaiger

01 May, 2024

Introduction

The objective of dinoR is to detect potential differences in nucleosome binding/accessibility patterns between two groups of NOMe-seq experiments. We recently developed a method, called guided NOMe-seq, which allows for obtaining high coverage NOMe-seq data from more than 1000 regions of interest (ROIs) in the genome at low cost, thereby allowing us to profile multiple samples at once (manuscript submitted). For example, as shown below, we generated NOMe-seq data targeting ROIs spanning CTCF or REST motifs from wild-type (WT) and ADNP knock-out mouse ES cells, in duplicates. To facilitate processing of these samples, with a focus on detecting differences in nucleosome positioning between WT and ADNP KO cells, we developed dinoR. dinoR focuses on comparing ROIs which are grouped together based on similar features (for example, which transcription factor binds the motif in the ROI), and statistical analysis of NOMe-seq pattern differences between samples. For quality control and plotting methylation across a single ROI, please refer to the Bioconductor package SingleMoleculeFootprinting.

Installation

dinoR is available on Bioconductor and can be installed via:

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("dinoR")

then we can load dinoR and other necessary packages:

suppressPackageStartupMessages({
    library(dinoR)
    library(ggplot2)
    library(dplyr)
})

Load the NOMe-seq data for ADNP Knock-Out and

WT mouse ES cells (two replicates each)

To obtain the example data shown below, we used biscuit to map 300bp paired-end reads to the genome, UMI-tools to remove duplicated UMIs, and the fetch-NOMe package to get the protection from GCH methylation calls for each read pair (fragment) overlapping a region of interest (ROI). We then use the R package NOMeConverteR to convert the resulting tibble into the RangedSummarizedExperiment object that is loaded below. This represents an efficient way of sharing NOMe-seq data.

Alternatively, mapping and obtaining the GCH methylation calls can be done using QuasR, or any other bisulfite aware mapping/quantification tool. The results then need to be converted into a RangedSummarizedExperiment. For an example NOMe-seq RangedSummarizedExperiment see the example data loaded below, or the data generated by using dinoR::createExampleData.

Importantly, the ROIs should all be the same length and centered around a transcription factor motif, with the strand of the motif indicated. That will ensure that the genomic positions around the motif are sorted according to motif strand, which will allow the user to observe potential asymmetries in protection from methylation relative to the TF motif.

data(NomeData)
NomeData
#> class: RangedSummarizedExperiment 
#> dim: 219 4 
#> metadata(0):
#> assays(5): nFragsFetched nFragsNonUnique nFragsBisFailed nFragsAnalyzed
#>   reads
#> rownames(219): Adnp_chr8_47978653_47979275
#>   Adnp_chr6_119394879_119395501 ... Rest_chr4_140283342_140283964
#>   Rest_chr7_64704080_64704702
#> rowData names(1): motif
#> colnames(4): AdnpKO_1 AdnpKO_2 WT_1 WT_2
#> colData names(2): samples group

The reads assay contains GPos objects with the GCH methylation data in two sparse logical matrices, one for protection from methylation, and one for methylation.

assays(NomeData)[["reads"]][1, 1]
#> [[1]]
#> UnstitchedGPos object with 623 positions and 2 metadata columns:
#>         seqnames       pos strand |            protection           methylation
#>            <Rle> <integer>  <Rle> |           <lgCMatrix>           <lgCMatrix>
#>     [1]     chr8  47978653      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>     [2]     chr8  47978654      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>     [3]     chr8  47978655      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>     [4]     chr8  47978656      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>     [5]     chr8  47978657      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>     ...      ...       ...    ... .                   ...                   ...
#>   [619]     chr8  47979271      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>   [620]     chr8  47979272      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>   [621]     chr8  47979273      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>   [622]     chr8  47979274      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>   [623]     chr8  47979275      + | FALSE:FALSE:FALSE:... FALSE:FALSE:FALSE:...
#>   -------
#>   seqinfo: 53 sequences from an unspecified genome; no seqlengths

Meta plots across ROIs with common TF motifs in the center

We generate metaplots, grouping our ROIs into those that have REST, CTCF, or ADNP bound to the motifs in their center. We use 2 samples from WT mouse ES cells, and two samples from ADNP KO mouse ES cells. We exclude any ROI - sample combinations which contain less than 10 reads (nr=10).

avePlotData <- metaPlots(NomeData = NomeData, nr = 10, ROIgroup = "motif")

# plot average plots
ggplot(avePlotData, aes(x = position, y = protection)) +
    geom_point(alpha = 0.5) +
    geom_line(aes(x = position, y = loess), col = "darkblue", lwd = 2) +
    theme_classic() +
    facet_grid(rows = vars(type), cols = vars(sample), scales = "free") +
    ylim(c(0, 100)) +
    geom_hline(
        yintercept = c(10, 20, 30, 40, 50, 60, 70, 80, 90),
        alpha = 0.5, color = "grey", linetype = "dashed"
    )

We can already see that while the NOMe footprints around REST and CTCF bound motifs don’t change, there are clear differences between WT and ADNP KO cells around the ADNP bound motifs.

Determine fragment counts for five chromatin patterns:

TF, open, upNuc, downNuc, Nuc

To quantify the differences visible in above meta plots, we adopted and slightly modified the approach of Sönmezer et al., 2021. We classify each fragment according to five types of footprints: transcription factor bound (TF), open chromatin, and nucleosome (we distinguish also upstream positioned nucleosome (upNuc), downstream positioned nucleosome (downNuc), and all other nucleosome (Nuc) footprints). To do this we use three windows (-50:-25, -8:8, 25:50) around the motif center (which should correspond to the ROI center of the provided ROIs).

NOMe patterns

NomeData <- footprintCalc(NomeData)
NomeData
#> class: RangedSummarizedExperiment 
#> dim: 219 4 
#> metadata(0):
#> assays(6): nFragsFetched nFragsNonUnique ... reads footprints
#> rownames(219): Adnp_chr8_47978653_47979275
#>   Adnp_chr6_119394879_119395501 ... Rest_chr4_140283342_140283964
#>   Rest_chr7_64704080_64704702
#> rowData names(1): motif
#> colnames(4): AdnpKO_1 AdnpKO_2 WT_1 WT_2
#> colData names(2): samples group

This results in the addition of an assay called footprints, which assigns each fragment one of the five footprint classes described above.

assays(NomeData)[["footprints"]][[1, 1]][1:10]
#> f1:M04764:177:000000000-JPNT5:1:2119:13500:10508_TGAGCTAG_CATCGACTC 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2115:9271:12803_TGAGCTAG_AGGCACCTG 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2114:19141:6070_TGAGCTAA_ATCTACCCA 
#>                                                                "tf" 
#> f1:M04764:177:000000000-JPNT5:1:2106:11802:15963_TGAGCTAG_CCACTGTTA 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2106:9228:15151_TGAGCTAG_TCAATCAAG 
#>                                                                "tf" 
#> f1:M04764:177:000000000-JPNT5:1:2104:15018:10388_TGAGCTAG_CCAGACGCC 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2104:13159:6640_TGAGCTAG_TGATGGACA 
#>                                                           "downNuc" 
#>  f1:M04764:177:000000000-JPNT5:1:2113:21587:8114_TGAGCTAG_TCTATCAGC 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2110:18921:3356_TGAGCTAG_TTAAAGTCT 
#>                                                                "tf" 
#>  f1:M04764:177:000000000-JPNT5:1:2104:27164:6456_TGAGCTAG_TATTTCTAC 
#>                                                                "tf"

Then we count the number of fragments in each sample-ROI combination supporting each footprint class. This results in the addition of an assay for each footprint category, and one for the combined counts across all footprint categories (“all”).

NomeData <- footprintQuant(NomeData)
assays(NomeData)[7:12]
#> List of length 6
#> names(6): tf open upNuc Nuc downNuc all

Note that if a fragment does not have methylation protection data in all three windows needed for classification, the fragment will not be used.

Next we can test for differential abundance of footprints between ADNP KO and WT samples.

Calculate differential NOMe-seq footprint abundance between ADNP KO and WT

We use edgeR to check for differences in abundance between wild type and ADNP KO samples for each footprint type fragment count compared to the total fragment counts. Library sizes for TMM normalization are calculated on the total fragment counts.

res <- diNOMeTest(NomeData,
    WTsamples = c("WT_1", "WT_2"),
    KOsamples = c("AdnpKO_1", "AdnpKO_2")
)
res
#> # A tibble: 1,040 × 10
#>     logFC logCPM     F  PValue    FDR contrasts ROI   motif logadjPval regulated
#>     <dbl>  <dbl> <dbl>   <dbl>  <dbl> <chr>     <chr> <chr>      <dbl> <chr>    
#>  1  1.69   10.9  15.8  2.10e-4 0.0436 open_vs_… Adnp… Adnp       1.36  up       
#>  2  1.53   10.4  10.2  2.35e-3 0.242  open_vs_… Adnp… Adnp       0.617 no       
#>  3  1.22    9.34  9.12 3.75e-3 0.242  open_vs_… Adnp… Adnp       0.617 no       
#>  4  1.05   10.6   8.73 4.65e-3 0.242  open_vs_… Adnp… Adnp       0.617 no       
#>  5 -1.46   10.5   7.71 7.44e-3 0.310  open_vs_… Ctcf… Adnp       0.509 no       
#>  6  1.13   10.2   7.28 9.13e-3 0.317  open_vs_… Adnp… Adnp       0.500 no       
#>  7  0.757  10.5   5.01 2.91e-2 0.666  open_vs_… Adnp… Adnp       0.177 no       
#>  8  1.00   10.8   4.87 3.16e-2 0.666  open_vs_… Adnp… Adnp       0.177 no       
#>  9 -1.00    9.62  4.54 3.76e-2 0.666  open_vs_… Rest… Adnp       0.177 no       
#> 10 -1.12   10.9   4.50 3.82e-2 0.666  open_vs_… Ctcf… Adnp       0.177 no       
#> # ℹ 1,030 more rows

We can then simply plot the number of regulated ROIs within each ROI type…

res %>%
    group_by(contrasts, motif, regulated) %>%
    summarize(n = n()) %>%
    ggplot(aes(x = motif, y = n, fill = regulated)) +
    geom_bar(stat = "identity") +
    facet_grid(~contrasts) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    scale_fill_manual(values = c("orange", "grey", "blue3"))
#> `summarise()` has grouped output by 'contrasts', 'motif'. You can override
#> using the `.groups` argument.

…or display the results in MA plots.

ggplot(res, aes(y = logFC, x = logCPM, col = regulated)) +
    geom_point() +
    facet_grid(~contrasts) +
    theme_bw() +
    scale_color_manual(values = c("orange", "grey", "blue3"))

Calculate the percentage of fragments in each footprint type and

plot a (clustered) heatmap comparing percentages in WT and ADNP KO

footprint_percentages <- footprintPerc(NomeData)
fpPercHeatmap(footprint_percentages)

Compare the footprint percentages and significance testing

results for ADNP KO and WT

compareFootprints(footprint_percentages, res,
    WTsamples = c("WT_1", "WT_2"),
    KOsamples = c("AdnpKO_1", "AdnpKO_2"), plotcols = 
        c("#f03b20", "#a8ddb5", "#bdbdbd")
)

We can see that in ADNP KO samples, transcription factor footprints significantly increase around ADNP motifs, while nucleosome footprints decrease.

Combining the nucleosome patterns

In case we are not interested in the upstream and downstream nucleosome patterns, but would rather keep all nucleosome pattern fragments within the nucleosome group, we can do that using the option combineNucCounts=TRUE.

res <- diNOMeTest(NomeData,
    WTsamples = c("WT_1", "WT_2"),
    KOsamples = c("AdnpKO_1", "AdnpKO_2"), combineNucCounts = TRUE
)
footprint_percentages <- footprintPerc(NomeData, combineNucCounts = TRUE)
compareFootprints(footprint_percentages, res,
    WTsamples = c("WT_1", "WT_2"),
    KOsamples = c("AdnpKO_1", "AdnpKO_2"), plotcols = 
        c("#f03b20", "#a8ddb5", "#bdbdbd")
)

Session information

sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 ggplot2_3.5.1              
#>  [3] dinoR_1.1.0                 SummarizedExperiment_1.35.0
#>  [5] Biobase_2.65.0              GenomicRanges_1.57.0       
#>  [7] GenomeInfoDb_1.41.0         IRanges_2.39.0             
#>  [9] S4Vectors_0.43.0            BiocGenerics_0.51.0        
#> [11] MatrixGenerics_1.17.0       matrixStats_1.3.0          
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        farver_2.1.1            fastmap_1.1.1          
#>  [4] digest_0.6.35           lifecycle_1.0.4         cluster_2.1.6          
#>  [7] Cairo_1.6-2             statmod_1.5.0           magrittr_2.0.3         
#> [10] compiler_4.4.0          rlang_1.1.3             sass_0.4.9             
#> [13] tools_4.4.0             utf8_1.2.4              yaml_2.3.8             
#> [16] knitr_1.46              S4Arrays_1.5.0          labeling_0.4.3         
#> [19] DelayedArray_0.31.0     RColorBrewer_1.1-3      abind_1.4-5            
#> [22] withr_3.0.0             purrr_1.0.2             grid_4.4.0             
#> [25] fansi_1.0.6             colorspace_2.1-0        edgeR_4.3.0            
#> [28] scales_1.3.0            iterators_1.0.14        cli_3.6.2              
#> [31] rmarkdown_2.26          crayon_1.5.2            generics_0.1.3         
#> [34] httr_1.4.7              rjson_0.2.21            cachem_1.0.8           
#> [37] stringr_1.5.1           zlibbioc_1.51.0         splines_4.4.0          
#> [40] parallel_4.4.0          XVector_0.45.0          vctrs_0.6.5            
#> [43] Matrix_1.7-0            jsonlite_1.8.8          GetoptLong_1.0.5       
#> [46] clue_0.3-65             magick_2.8.3            locfit_1.5-9.9         
#> [49] foreach_1.5.2           limma_3.61.0            jquerylib_0.1.4        
#> [52] tidyr_1.3.1             glue_1.7.0              codetools_0.2-20       
#> [55] cowplot_1.1.3           stringi_1.8.3           shape_1.4.6.1          
#> [58] gtable_0.3.5            UCSC.utils_1.1.0        ComplexHeatmap_2.21.0  
#> [61] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
#> [64] htmltools_0.5.8.1       GenomeInfoDbData_1.2.12 circlize_0.4.16        
#> [67] R6_2.5.1                doParallel_1.0.17       evaluate_0.23          
#> [70] lattice_0.22-6          highr_0.10              png_0.1-8              
#> [73] bslib_0.7.0             Rcpp_1.0.12             SparseArray_1.5.0      
#> [76] xfun_0.43               pkgconfig_2.0.3         GlobalOptions_0.1.2