0.1 Introduction

methodical includes a method for identifying genomic regions where DNA methylation is strongly correlated with the transcriptional activity of nearby transcription start sites. We refer to these regions as TSS-proximal methylation-controlled regulatory regions - TMRs for short. This vignette walks through the identification of TMRs for the TUBB6 gene in normal prostate samples. The process is easily scaled-up to search for TMRs associated with a large number of TSS simultaneously using parallel computing.

We start by loading the objects we need for this analysis: The location of the TUBB6 TSS (based on the ENST00000591909 isoform), a RangedSummarizedExperiment object housing methylation values for all CpGs within +/- 5,000 base pairs of the TSS and normalized counts for the TUBB6 transcript.

# Installing Methodical
if (!require("BiocManager"))
# Load required objects
data("tubb6_tss", package = "methodical")
data("tubb6_meth_rse", package = "methodical"); tubb6_meth_rse <- eval(tubb6_meth_rse)
tubb6_meth_rse <- eval(tubb6_meth_rse)
data("tubb6_transcript_counts", package = "methodical")

0.2 Calculation of correlation values between DNA methylation and transcription

The first step in the identification of TMRs is to calculate the correlation values between methylation of CpG sites close to a TSS and the expression of the associated transcript. We do this using the calculateMethSiteTranscriptCors function which takes a RangedSummarizedExperiment with DNA methylation data, a GRanges object with TSS of interest and a table with counts for the transcripts associated with each TSS. We can choose the maximum distance of CpGs we want to consider upstream and downstream of the TSS. Here we use 5,000 bp upstream and downstream.

calculateMethSiteTranscriptCors returns a list of tables with the correlation results for each TSS, which is just a single TSS in our case. Each table has an attribute tss_range which is a GRanges object with the location of the associated TSS.

# Calculate correlation values between methylation values and transcript values for TUBB6
cpg_meth_transcript_cors <- calculateMethSiteTranscriptCors(meth_rse = tubb6_meth_rse, 
  transcript_expression_table = tubb6_transcript_counts, tss_gr = tubb6_tss, 
  expand_upstream = 5000, expand_downstream = 5000, cor_method = "spearman")
#> There are 1 genes/transcripts in common between tss_gr and transcript_expression_table
#> Using spearman correlation method
#> Calculating correlations for chunk 1 of 1

# Since cpg_meth_transcript_cors is just a list of with 1 table, we'll extract this table 
# from the list 
tubb6_cpg_meth_transcript_cors <- cpg_meth_transcript_cors$ENST00000591909

# Take a look at the results
#>        meth_site transcript_name          cor      p_val distance_to_tss
#> 1 chr18:12303394 ENST00000591909  0.055177462 0.53943969           -4840
#> 2 chr18:12303408 ENST00000591909  0.208891300 0.01890552           -4826
#> 3 chr18:12303428 ENST00000591909  0.043386526 0.62953167           -4806
#> 4 chr18:12303472 ENST00000591909  0.050422445 0.57499756           -4762
#> 5 chr18:12303480 ENST00000591909  0.005252191 0.95345508           -4754
#> 6 chr18:12303484 ENST00000591909 -0.052151327 0.56194303           -4750

# Extract the location of the TSS from the results
#> GRanges object with 1 range and 6 metadata columns:
#>                   seqnames    ranges strand |            gene_id   gene_name
#>                      <Rle> <IRanges>  <Rle> |        <character> <character>
#>   ENST00000591909    chr18  12308234      + | ENSG00000176014.13       TUBB6
#>                       hgnc_id      gene_type   transcript_id transcript_type
#>                   <character>    <character>     <character>     <character>
#>   ENST00000591909  HGNC:20776 protein_coding ENST00000591909  protein_coding
#>   -------
#>   seqinfo: 25 sequences from an unspecified genome; no seqlengths

0.3 Plotting CpG Values

We’ll next create a plot of the correlation values using plotMethSiteCorCoefs so that we can visually inspect if there are any interesting patterns. On the x-axis we can show the chromosomal coordinates of the CpG sites or alternatively their distance to the TSS if we provide a GRanges object with the TSS location to the argument reference_region.

# Plot methylation-transcription correlation values showing chromosomal coordinates of CpG sites.
tubb6_correlation_plot_chrom_coords <- plotMethSiteCorCoefs(tubb6_cpg_meth_transcript_cors, 
  ylabel = "Spearman Correlation", value_colours = "set2") + 
geom_hline(yintercept = 0, linetype = "dashed")

# Plot methylation-transcription correlation values showing distance of CpG sites to TSS. 
tubb6_correlation_plot_tss_dist <- plotMethSiteCorCoefs(tubb6_cpg_meth_transcript_cors, 
  ylabel = "Spearman Correlation", value_colours = "set2", 
  reference_tss = attributes(tubb6_cpg_meth_transcript_cors)$tss_range) +
geom_hline(yintercept = 0, linetype = "dashed")

0.4 Identification of TMRs

We’ll now identify TMRs using the correlation values for CpG methylation and the TUBB6 transcript counts and the statistical significance of the correlations. This is achieved by using the correlation values and their associated p-values to calculate a Methodical score which is the the log10 of the p-value multiplied by the opposite if the sign of the correlation (i.e. -1 for a positive correlation and 1 for a negative correlation) Thus, negative correlations will have a negative Methodical score and positive correlations will have a positive Methodical score, with the significance of the correlations determining their magnitude.

We’ll show the Methodical scores for the TUBB6 CpG correlation values using the plotMethodicalScores function.

# Plot methodical scores for CpGs near TUBB6 TSS
tubb6_methodical_scores_plot <- plotMethodicalScores(
  meth_site_values = tubb6_cpg_meth_transcript_cors, 
  p_value_threshold = 0.005, smooth_scores = FALSE) +
geom_hline(yintercept = 0, linetype = "dashed")

We smooth the Methodical scores using an exponential moving average to avoid single CpGs having too much influence during the identification of TMRs. This smoothing involves taking the mean methodical score of a central CpG and an equal number of CpGs upstream and downstream of it, with weights that decay exponentially moving away from the central CpG.

We set the smooth_scores parameter to TRUE to add a curve going through the smoothed values to the plot. The arguments offset_length and smoothing_factor control the number of CpGs considered upstream and downstream of the central CpG and the exponential decay rate, respectively.

# Smooth Methodical scores
tubb6_smoothed_methodical_scores_plot <- plotMethodicalScores(
  meth_site_values = tubb6_cpg_meth_transcript_cors, p_value_threshold = 0.005, 
  smooth_scores = TRUE, smoothed_curve_colour = "hotpink2", offset_length = 10, 
  smoothing_factor = 0.75) +
geom_hline(yintercept = 0, linetype = "dashed")
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).

We use two significance thresholds to identify TMRs: one for TMRs where DNA methylation is negatively associated with transcription (negative TMRs) and another for TMRs where DNA methylation is positively associated with transcription (positive TMRs). These thresholds are defined with a p-value which is converted into negative and positive Methodical scores thresholds. We’ll use a p-value of 0.005 for the thresholds here which results in TMR thresholds of 2.30 and -2.30 for positive and negative TMRs respectively. We can then visualize if the smoothed Methodical scores breach these thresholds anywhere and identify TMRs.

# Add significance thresholds to plot
tubb6_smoothed_methodical_scores_plot <- plotMethodicalScores(meth_site_values = 
  tubb6_cpg_meth_transcript_cors, p_value_threshold = 0.005, smooth_scores = TRUE, 
  smoothed_curve_colour = "hotpink2") +
  geom_hline(yintercept = 0, linetype = "dashed")
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).

We can see two regions where the Methodical scores breach the thresholds: one region around 12,305,000 where the positive threshold is breached and another just before 12,307,500 where the negative threshold is breached. Thus, we should find one negative and one positive TMR for TUBB6.

The calculation of Methodical scores, their smoothing and identification of TMRs is all done using the findTMRs function which takes the correlation results from calculateMethSiteTranscriptCors as input. We can set the p-value threshold as well as the parameters for the smoothing, offset_length and smoothing_factor, which behave the same as they do in plotMethodicalScores.

We can also specify the minimum number of CpG sites a TMR contain using min_meth_sites and merge TMRs with the same direction within a minimum distance using min_gapwidth. There are no close TMRs with the same direction for TUBB6 so this argument has no effect, but it can prove useful since it would generally be desirable to have one large TMR rather than several shorter TMRs occurring in the same region since the intervening CpGs will likely also display strong correlations.

findTMRs returns a GRanges with metadata columns giving the TMR direction, the name of the TMRs, the number of CpG sites they overlap (meth_site_count), their distance to the TSS and the location of the TSS as a character which can be converted to a GRanges by calling GRanges on it.

# Identify TMRs for TUBB6
tubb6_tmrs <- findTMRs(correlation_df = tubb6_cpg_meth_transcript_cors, offset_length = 10, 
  smoothing_factor = 0.75, min_meth_sites = 5, min_gapwidth = 150)
#> GRanges object with 2 ranges and 11 metadata columns:
#>       seqnames            ranges strand | direction              tmr_name
#>          <Rle>         <IRanges>  <Rle> |  <factor>           <character>
#>   [1]    chr18 12304719-12305243      * |  Positive ENST00000591909_tmr_1
#>   [2]    chr18 12306668-12307631      * |  Negative ENST00000591909_tmr_2
#>       meth_site_count distance_to_tss     tss_location            gene_id
#>             <integer>       <numeric>      <character>        <character>
#>   [1]              13           -2991 chr18:12308234:+ ENSG00000176014.13
#>   [2]              38            -603 chr18:12308234:+ ENSG00000176014.13
#>         gene_name     hgnc_id      gene_type   transcript_id transcript_type
#>       <character> <character>    <character>     <character>     <character>
#>   [1]       TUBB6  HGNC:20776 protein_coding ENST00000591909  protein_coding
#>   [2]       TUBB6  HGNC:20776 protein_coding ENST00000591909  protein_coding
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can add these TMRs to either the plot of correlation values or Methodical scores using plotTMRs to visualize where they are located in relation to these values. We’ll plot the TMRs on the Methodical scores first and then the correlation values. As with plotMethSiteCorCoefs, we can add a reference_positive so that we can see where the TMRs are located relative to the TSS, provided the plot used the same reference_position. We can clearly see that the TMRs overlap regions where many CpG sites display strong correlation values between their methylation and expression of TUBB6.

# Show location of TMRs on Methodical scores plot
plotTMRs(meth_site_plot = tubb6_smoothed_methodical_scores_plot, tmrs_gr = tubb6_tmrs)
#> Warning: Removed 20 rows containing missing values or values outside the scale range
#> (`geom_line()`).

# Show location of TMRs on correlation value plot
plotTMRs(meth_site_plot = tubb6_correlation_plot_tss_dist, tmrs_gr = tubb6_tmrs,
  reference_tss = GRanges(tubb6_tmrs$tss_location[1]))

We can see that the negative TMR is centred around 1,000 base pairs upstream of the TSS and that the positive TMR is centred around 3,000 base pairs upstream.

0.5 SessionInfo

#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/ 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> other attached packages:
#>  [1] HDF5Array_1.33.0            DelayedArray_0.31.0        
#>  [3] SparseArray_1.5.0           S4Arrays_1.5.0             
#>  [5] abind_1.4-5                 Matrix_1.7-0               
#>  [7] rtracklayer_1.65.0          AnnotationHub_3.13.0       
#>  [9] BiocFileCache_2.13.0        dbplyr_2.5.0               
#> [11] rhdf5_2.49.0                TumourMethData_1.1.0       
#> [13] DESeq2_1.45.0               methodical_1.1.0           
#> [15] SummarizedExperiment_1.35.0 Biobase_2.65.0             
#> [17] MatrixGenerics_1.17.0       matrixStats_1.3.0          
#> [19] ggplot2_3.5.1               GenomicRanges_1.57.0       
#> [21] GenomeInfoDb_1.41.0         IRanges_2.39.0             
#> [23] S4Vectors_0.43.0            BiocGenerics_0.51.0        
#> [25] BiocStyle_2.33.0           
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.2                bitops_1.0-7             rlang_1.1.3             
#>  [4] magrittr_2.0.3           compiler_4.4.0           RSQLite_2.3.6           
#>  [7] png_0.1-8                vctrs_0.6.5              pkgconfig_2.0.3         
#> [10] crayon_1.5.2             fastmap_1.1.1            magick_2.8.3            
#> [13] XVector_0.45.0           labeling_0.4.3           utf8_1.2.4              
#> [16] Rsamtools_2.21.0         rmarkdown_2.26           UCSC.utils_1.1.0        
#> [19] tinytex_0.50             purrr_1.0.2              bit_4.0.5               
#> [22] xfun_0.43                zlibbioc_1.51.0          cachem_1.0.8            
#> [25] jsonlite_1.8.8           blob_1.2.4               rhdf5filters_1.17.0     
#> [28] Rhdf5lib_1.27.0          BiocParallel_1.39.0      parallel_4.4.0          
#> [31] R6_2.5.1                 bslib_0.7.0              jquerylib_0.1.4         
#> [34] RcppRoll_0.3.0           iterators_1.0.14         Rcpp_1.0.12             
#> [37] bookdown_0.39            knitr_1.46               R.utils_2.12.3          
#> [40] tidyselect_1.2.1         yaml_2.3.8               codetools_0.2-20        
#> [43] curl_5.2.1               lattice_0.22-6           tibble_3.2.1            
#> [46] withr_3.0.0              KEGGREST_1.45.0          evaluate_0.23           
#> [49] ExperimentHub_2.13.0     Biostrings_2.73.0        pillar_1.9.0            
#> [52] BiocManager_1.30.22      filelock_1.0.3           foreach_1.5.2           
#> [55] generics_0.1.3           RCurl_1.98-1.14          BiocVersion_3.20.0      
#> [58] munsell_0.5.1            scales_1.3.0             glue_1.7.0              
#> [61] tools_4.4.0              BiocIO_1.15.0            data.table_1.15.4       
#> [64] GenomicAlignments_1.41.0 locfit_1.5-9.9           XML_3.99-0.16.1         
#> [67] grid_4.4.0               AnnotationDbi_1.67.0     colorspace_2.1-0        
#> [70] GenomeInfoDbData_1.2.12  restfulr_0.0.15          cli_3.6.2               
#> [73] rappdirs_0.3.3           fansi_1.0.6              dplyr_1.1.4             
#> [76] gtable_0.3.5             R.methodsS3_1.8.2        sass_0.4.9              
#> [79] digest_0.6.35            farver_2.1.1             rjson_0.2.21            
#> [82] memoise_2.0.1            htmltools_0.5.8.1        R.oo_1.26.0             
#> [85] lifecycle_1.0.4          httr_1.4.7               mime_0.12               
#> [88] bit64_4.0.5