SingleMoleculeFootprinting

Introduction

SingleMoleculeFootprinting is an R package providing functions to analyze Single Molecule Footprinting (SMF) data. Following the workflow exemplified in this vignette, the user will be able to perform basic data analysis of SMF data with minimal coding effort.

Starting from an aligned bam file, we show how to

perform quality controls over sequencing libraries
extract methylation information at the single molecule level accounting for the two possible kind of SMF experiments (single enzyme or double enzyme)
classify single molecules based on their patterns of molecular occupancy
plot SMF information at a given genomic location

For installation, the user can use the following:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SingleMoleculeFootprinting")

For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the qAlign function from QuasR as exemplified in the chunk below.

library(parallel)
library(QuasR)
library(BSgenome.Mmusculus.UCSC.mm10) 

cl <- makeCluster(40)
prj <- QuasR::qAlign(sampleFile = Qinput,
                     genome = "BSgenome.Mmusculus.UCSC.mm10",
                     aligner = "Rbowtie",
                     projectName = "prj", 
                     paired = "fr",
                     bisulfite = "undir", 
                     alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
                     alignmentsDir = "./", 
                     cacheDir = tempdir(),
                     clObj = cl)

Loading

library(BSgenome.Mmusculus.UCSC.mm10)
library(SingleMoleculeFootprinting)
library(SingleMoleculeFootprintingData)
library(parallel)

Define arguments

SingleMoleculeFootprinting inherits QuasR’s philosophy of working with pointer files. Briefly, a pointer is a tab-delimited file with two or three fields indicating the location of the input data files. For more details, please check the qAlign documentation. Here we show how our pointer file looks like.

N.b.: The execution of the example code in this vignette, as well as some functions of this software package, depend on accessory data that can be downloaded and cached through the data package SingleMoleculeFootprintingData. Under the hood, we download this data for the user and create the QuasR sampleSheet file for the example data at tempdir() under the name NRF1Pair_Qinput.txt. We just ask the user to make sure that their default cache directory has enough storage capacity (~1 Gb). The user can check this via ExperimentHub::getExperimentHubOption(arg = "CACHE") and eventually change this value via ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location").

Qinput = paste0(CacheDir, "/NRF1Pair_Qinput.txt")
suppressMessages(readr::read_delim(Qinput, delim = "\t"))
## # A tibble: 1 × 2
##   FileName                                            SampleName  
##   <chr>                                               <chr>       
## 1 /home/biocbuild/.cache/R/ExperimentHub/NRF1pair.bam NRF1pair_DE_

Library QCs

Before investing in a deep sequencing run for a SMF experiment, it is advisable to first perform a shallow sequencing run and to perform quality controls on the sequencing libraries.

QuasR QC report

It is always a good idea to examine canonical quality measures after aligning. We advice the user to employ the qQCreport function from QuasR.

Bait capture efficiency

If SMF was performed on an large genome (e.g Mouse) it is possible that bait capture was performed to focus the sequencing efforts: here we check how efficient that process was by essentially computing the ratio of genomic alignments inside bait regions over the total ones.

BaitRegions <- SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
## see ?SingleMoleculeFootprintingData and browseVignettes('SingleMoleculeFootprintingData') for documentation
## loading from cache
clObj = makeCluster(5)
BaitCaptureEfficiency <- BaitCapture(sampleSheet = Qinput, 
                                     genome = BSgenome.Mmusculus.UCSC.mm10, 
                                     baits = BaitRegions, clObj = clObj)
## all necessary alignment files found
## preparing to run on 5 nodes...done
## counting alignments...done
## preparing to run on 5 nodes...done
## counting alignments...done
stopCluster(clObj)
BaitCaptureEfficiency
## [1] 1

In this case the capture efficiency equals to 1 because the example data was purposefully subset for an interesting region which falls entirely within baits. Under normal circumstances, one expects this value to be lower than 1 for the mouse genome for example we observe values around 0.7.

Conversion rate precision

Here we ask how much of the observed Cytosine methylation falls in the expected contexts (CpG or GpC). Beware, this is a much slower computation than the above.

ConversionRateValue <- ConversionRate(sampleSheet = Qinput, 
                                          genome = BSgenome.Mmusculus.UCSC.mm10, 
                                          chr = 6)

For this sample, the observed conversion rate is 92.4%. This value happens to be slightly below the expected value of ~95%

Analysis of single site

Methylation extraction

Methylation values at the single molecule level can be extracted for the region of interest from aligned data using the CallContextMethylation function. Under the hood, the function performs the following operations:

Identification of the methylation status for the Cytosines in the genomic context relevant for the experiment (Single enzyme, double enzyme, etc.). The type of the experiment is inferred based on which user-provided substring is found in the SampleName field of the QuasR pointer file:

ExperimentType	substring	RelevanContext	Notes
Single enzyme SMF	_NO_	DGCHN & NWCGW	Methylation info is reported separately for each context
Double enzyme SMF	_DE_	GCH + HCG	Methylation information is aggregated across the contexts
No enzyme (endogenous mCpG only)	_SS_	CG	-

Filtering reads based on conversion rate
Collapsing strands to “-”
Filtering Cytosines for coverage

MySample <- suppressMessages(readr::read_delim(Qinput, delim = "\t")[[2]])
Region_of_interest <- GRanges(seqnames = "chr6", 
                              ranges = IRanges(start = 88106000, 
                                               end = 88106500), 
                              strand = "*")

Methylation <- CallContextMethylation(sampleSheet = Qinput, 
                                      sample = MySample, 
                                      genome = BSgenome.Mmusculus.UCSC.mm10, 
                                      range = Region_of_interest, 
                                      coverage = 20, 
                                      ConvRate.thr = 0.2)
## Setting QuasR project
## all necessary alignment files found
## Calling methylation at all Cytosines
## 0.9% of reads found with conversion rate above 0.2
## Subsetting Cytosines by permissive genomic context (NGCNN, NNCGN)
## Collapsing strands
## 0 reads found mapping to the + strand, collapsing to -
## 0 reads found mapping to the + strand, collapsing to -
## Filtering Cs for coverage
## Detected experiment type: DE
## Subsetting Cytosines by strict genomic context (GCH, HCG) based on the detected experiment type: DE
## Merging matrixes
Methylation[[1]]
## GRanges object with 38 ranges and 1 metadata column:
##        seqnames    ranges strand | NRF1pair_DE__M
##           <Rle> <IRanges>  <Rle> |      <numeric>
##    [1]     chr6  88106112      - |       0.741117
##    [2]     chr6  88106116      - |       0.589141
##    [3]     chr6  88106118      - |       0.838407
##    [4]     chr6  88106127      - |       0.543319
##    [5]     chr6  88106138      - |       0.400601
##    ...      ...       ...    ... .            ...
##   [34]     chr6  88106434      - |       0.554072
##   [35]     chr6  88106444      - |       0.607526
##   [36]     chr6  88106455      - |       0.679593
##   [37]     chr6  88106492      - |       0.720641
##   [38]     chr6  88106495      - |       0.809129
##   -------
##   seqinfo: 239 sequences (1 circular) from mm10 genome
Methylation[[2]][1:10, 1:10]
##                                               88106112 88106116 88106118
## M00758:819:000000000-CB7NB:1:1101:10081:9865         0        1        1
## M00758:819:000000000-CB7NB:1:1101:10119:12887        0        0        0
## M00758:819:000000000-CB7NB:1:1101:10172:5248         1        1        1
## M00758:819:000000000-CB7NB:1:1101:10214:24193        0        0        0
## M00758:819:000000000-CB7NB:1:1101:10219:24481        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10408:27375        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10428:10873        0        1        1
## M00758:819:000000000-CB7NB:1:1101:10428:22900        1        1        1
## M00758:819:000000000-CB7NB:1:1101:10453:11606        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10489:13730        1        1        1
##                                               88106127 88106138 88106153
## M00758:819:000000000-CB7NB:1:1101:10081:9865         1        0        1
## M00758:819:000000000-CB7NB:1:1101:10119:12887        0        1        1
## M00758:819:000000000-CB7NB:1:1101:10172:5248         1        0        1
## M00758:819:000000000-CB7NB:1:1101:10214:24193        0        0        0
## M00758:819:000000000-CB7NB:1:1101:10219:24481        0        0        0
## M00758:819:000000000-CB7NB:1:1101:10408:27375        0        0        0
## M00758:819:000000000-CB7NB:1:1101:10428:10873        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10428:22900        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10453:11606        0        1        0
## M00758:819:000000000-CB7NB:1:1101:10489:13730        1        0        1
##                                               88106155 88106158 88106164
## M00758:819:000000000-CB7NB:1:1101:10081:9865         1        1        1
## M00758:819:000000000-CB7NB:1:1101:10119:12887        1        1        1
## M00758:819:000000000-CB7NB:1:1101:10172:5248         1        1        1
## M00758:819:000000000-CB7NB:1:1101:10214:24193        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10219:24481        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10408:27375        1        0        1
## M00758:819:000000000-CB7NB:1:1101:10428:10873        1        1        1
## M00758:819:000000000-CB7NB:1:1101:10428:22900        1        1        1
## M00758:819:000000000-CB7NB:1:1101:10453:11606        1        1        1
## M00758:819:000000000-CB7NB:1:1101:10489:13730        1        1        1
##                                               88106172
## M00758:819:000000000-CB7NB:1:1101:10081:9865         1
## M00758:819:000000000-CB7NB:1:1101:10119:12887        1
## M00758:819:000000000-CB7NB:1:1101:10172:5248         1
## M00758:819:000000000-CB7NB:1:1101:10214:24193        1
## M00758:819:000000000-CB7NB:1:1101:10219:24481        1
## M00758:819:000000000-CB7NB:1:1101:10408:27375        1
## M00758:819:000000000-CB7NB:1:1101:10428:10873        1
## M00758:819:000000000-CB7NB:1:1101:10428:22900        1
## M00758:819:000000000-CB7NB:1:1101:10453:11606        1
## M00758:819:000000000-CB7NB:1:1101:10489:13730        1

The following chunk has the sole scope of downsampling reads to make the vignette lighter. The user should skip this.

n = 1000
readsSubset <- sample(seq(nrow(Methylation[[2]])), n)
Methylation[[2]] <- Methylation[[2]][readsSubset,]

Plotting single sites

Already at this stage it is possible to visually inspect results. To that end, we provide the function PlotAvgSMF to plot the average SMF signal, defined as 1 - average methylation, at a genomic locus of interest. Optionally, the user can pass a GRanges object of genomic coordinates for the transcription factor binding sites (TFBSs) of interest to include in the plot, we show an example of such an object.

TFBSs <- GenomicRanges::GRanges("chr6", 
                               IRanges(c(88106216, 88106253),
                                       c(88106226, 88106263)), 
                               strand = "-")
elementMetadata(TFBSs)$name <- c("NRF1", "NRF1")
names(TFBSs) <- c(paste0("TFBS_", c(4305215, 4305216)))
TFBSs
## GRanges object with 2 ranges and 1 metadata column:
##                seqnames            ranges strand |        name
##                   <Rle>         <IRanges>  <Rle> | <character>
##   TFBS_4305215     chr6 88106216-88106226      - |        NRF1
##   TFBS_4305216     chr6 88106253-88106263      - |        NRF1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

PlotAvgSMF(MethGR = Methylation[[1]], 
           range = Region_of_interest, 
           TFBSs = TFBSs)

Furthermore, the function PlotSM can be used to plot the single molecule SMF information at a given site.

PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest)
## No sorting passed or specified, will plot unsorted reads

Optionally, hierarchical clustering can be performed by setting the parameter SortedReads = "HC". This can be useful to aggregate reads visually in order to spot PCR artifacts. N.b. reads will be downsampled to 500.

PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest, 
       SortedReads = "HC")
## Perfoming hierarchical clustering on single molecules before plotting

Single Molecule Sorting

Ultimately though, we want to classify reads based on their patterns of molecular occupancy. To that end we provide the functions SortReadsBySingleTF and SortReadsByTFCluster to classify reads based either on the occupancy patterns of one or multiple transcription factors.

Under the hood, the classification is based on the definition of \(n+2\) bins (with \(n\) being the number of TFs). The \(n\) bins are each centered around one of the TFBSs of interest, while the 2 extra bins are located upstream and downstream of the two outmost TFBSs.

For SortReadsBySingleTF, the coordinates of the bins relative to the center of the TFBS are [-35;-25], [-15;+15], [+25,+35]. Instead, the function SortReadsByTFCluster draws a bin with coordinates [-7;+7] around the center of each TFBS, a bin with coordinates [-35;-25] relative to center of the most upstream TFBS and a bin with coordinates [+25,+35] relative to the center of the most downstream TFBS. The user can also employ custom coordinates by specifying them explicitly using the function SortReads.

For each read, the binary methylation status of the cytosines contained in each bin is averaged to either a 0 or a 1 such that each read is eventually represented as sequence of \(n+2\) binary digits, for a total of \(2^{(n+2)}\) possible sequences.

Here, we show a usage case for the SortReadsByTFCluster function, as we have already identified the double binding of NRF1 at the genomic site under scrutiny. Usage and exploration of the output is identical for the other function, except for the the format of the TFBSs argument which should consist of a GRanges object of length 1 for SortReadsBySingleTF and of length \(>\) 1 for SortReadsByTFCluster.

SortedReads_TFCluster <- SortReadsByTFCluster(MethSM = Methylation[[2]], 
                                             TFBSs = TFBSs)
## Collecting summarized methylation for bins
## TF cluster mode

## Number of retrieved states: 13
## States population:
## 0000 0001 0011 0101 0111 1000 1001 1010 1011 1100 1101 1110 1111 
##    3    7   11    3   13   28  306    6  168   29  131   18  275

N.b. non-populated states are not returned.

The output of each of these sorting functions can be passed directly as the SortedReads argument of the PlotSM function.

PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest, 
       SortedReads = SortedReads_TFCluster)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair

N.b. despite sorting reads by a TF cluster is in principle possible with clusters of any size, as of now the PlotSM function can only deal with TF pairs.

In order to be quantitative about these observations, the user can employ the StateQuantificationPlot. The function outputs a bar plot annotated with the percentage of reads found in each state. The function takes, as argument, the output of either of the two sorting functions.

StateQuantificationPlot(SortedReads = SortedReads_TFCluster)
## Inferring sorting was performed by TF pair

Finally, we provide the wrapper function PlotSingleSiteSMF to plot at once the three kinds of information detailed above and to export results as a pdf.

PlotSingleSiteSMF(ContextMethylation = Methylation, 
                  sample = MySample, 
                  range = Region_of_interest, 
                  SortedReads = SortedReads_TFCluster, 
                  TFBSs = TFBSs, 
                  saveAs = NULL)
## Subsetting data by range (extended)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair

## Inferring sorting was performed by TF pair

Session information

This vignette was produced under:

## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SingleMoleculeFootprintingData_1.11.0 SingleMoleculeFootprinting_1.12.0    
##  [3] BSgenome.Mmusculus.UCSC.mm10_1.4.3    BSgenome_1.72.0                      
##  [5] rtracklayer_1.64.0                    BiocIO_1.14.0                        
##  [7] Biostrings_2.72.0                     XVector_0.44.0                       
##  [9] GenomicRanges_1.56.0                  GenomeInfoDb_1.40.0                  
## [11] IRanges_2.38.0                        S4Vectors_0.42.0                     
## [13] BiocGenerics_0.50.0                  
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.2                   bitops_1.0-7               
##   [3] QuasR_1.44.0                deldir_2.0-4               
##   [5] httr2_1.0.1                 biomaRt_2.60.0             
##   [7] rlang_1.1.3                 magrittr_2.0.3             
##   [9] Rbowtie_1.44.0              matrixStats_1.3.0          
##  [11] compiler_4.4.0              RSQLite_2.3.6              
##  [13] GenomicFeatures_1.56.0      png_0.1-8                  
##  [15] vctrs_0.6.5                 txdbmaker_1.0.0            
##  [17] stringr_1.5.1               pwalign_1.0.0              
##  [19] pkgconfig_2.0.3             crayon_1.5.2               
##  [21] fastmap_1.1.1               dbplyr_2.5.0               
##  [23] utf8_1.2.4                  Rsamtools_2.20.0           
##  [25] rmarkdown_2.26              tzdb_0.4.0                 
##  [27] UCSC.utils_1.0.0            purrr_1.0.2                
##  [29] bit_4.0.5                   xfun_0.43                  
##  [31] zlibbioc_1.50.0             cachem_1.0.8               
##  [33] jsonlite_1.8.8              progress_1.2.3             
##  [35] blob_1.2.4                  highr_0.10                 
##  [37] DelayedArray_0.30.0         BiocParallel_1.38.0        
##  [39] jpeg_0.1-10                 prettyunits_1.2.0          
##  [41] R6_2.5.1                    VariantAnnotation_1.50.0   
##  [43] bslib_0.7.0                 stringi_1.8.3              
##  [45] RColorBrewer_1.1-3          jquerylib_0.1.4            
##  [47] Rcpp_1.0.12                 SummarizedExperiment_1.34.0
##  [49] knitr_1.46                  readr_2.1.5                
##  [51] Matrix_1.7-0                tidyselect_1.2.1           
##  [53] abind_1.4-5                 yaml_2.3.8                 
##  [55] codetools_0.2-20            hwriter_1.3.2.1            
##  [57] curl_5.2.1                  lattice_0.22-6             
##  [59] tibble_3.2.1                plyr_1.8.9                 
##  [61] withr_3.0.0                 Biobase_2.64.0             
##  [63] ShortRead_1.62.0            KEGGREST_1.44.0            
##  [65] evaluate_0.23               archive_1.1.8              
##  [67] BiocFileCache_2.12.0        xml2_1.3.6                 
##  [69] ExperimentHub_2.12.0        BiocManager_1.30.22        
##  [71] pillar_1.9.0                filelock_1.0.3             
##  [73] MatrixGenerics_1.16.0       generics_0.1.3             
##  [75] vroom_1.6.5                 RCurl_1.98-1.14            
##  [77] BiocVersion_3.19.1          hms_1.1.3                  
##  [79] GenomicFiles_1.40.0         glue_1.7.0                 
##  [81] tools_4.4.0                 interp_1.1-6               
##  [83] AnnotationHub_3.12.0        data.table_1.15.4          
##  [85] GenomicAlignments_1.40.0    XML_3.99-0.16.1            
##  [87] grid_4.4.0                  latticeExtra_0.6-30        
##  [89] AnnotationDbi_1.66.0        GenomeInfoDbData_1.2.12    
##  [91] restfulr_0.0.15             cli_3.6.2                  
##  [93] rappdirs_0.3.3              fansi_1.0.6                
##  [95] S4Arrays_1.4.0              dplyr_1.1.4                
##  [97] sass_0.4.9                  digest_0.6.35              
##  [99] SparseArray_1.4.0           rjson_0.2.21               
## [101] memoise_2.0.1               htmltools_0.5.8.1          
## [103] lifecycle_1.0.4             httr_1.4.7                 
## [105] mime_0.12                   bit64_4.0.5