Contents

1 Introduction

exomePeak2 provides bias-aware quantification and peak detection for Methylated RNA immunoprecipitation sequencing data (MeRIP-Seq). MeRIP-Seq is a commonly applied sequencing technology that can measure the location and abundance of RNA modification sites under given cell line conditions. However, quantification and peak calling in MeRIP-Seq are sensitive to PCR amplification biases, which generally present in next-generation sequencing (NGS) technologies. In addition, the count data generated by RNA-Seq exhibits significant biological variations between biological replicates. exomePeak2 collectively address the challenges by introducing a series of robust data science tools tailored for MeRIP-Seq. Using exomePeak2, users can perform peak calling, modification site quantification and differential analysis through a straightforward single-step function. Alternatively, multi-step functions can be used to generate diagnostic plots and perform customized analyses.

2 Installation

To install exomePeak2 from bioconductor, start R (version >“3.6”) and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("exomePeak2")

For order versions of R, please refer to the appropriate [Bioconductor release](https://www.bioconductor.org/about/release-announcements/}.

3 Peak Calling

For the peak calling of the MeRIP-Seq experiment, exomePeak2 requires that the alignment result be stored in BAM format. User can specify the corresponding BAM directories of IP and input samples at the arguments bam_ip and bam_input respectively.

The following example demonstrates the peak calling from BAM and GFF files. Besides GFF file, transcript annotation can also be provided by the bioconductor TxDb object. The TxDb will be automatically downloaded if the corresponding UCSC genome name is filled in the genome argument.

The genome sequence is necessary for the correction of GC content bias. If both the genome and the bsgenome arguments are missing ( = NULL ), exomPeak2 will perform peak calling without correcting PCR amplification bias.

library(exomePeak2)

set.seed(1)

GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)

f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE)
## class: SummarizedExomePeak 
## dim: 31 7 
## metadata(0):
## assays(2): counts GCsizeFactors
## rownames(31): peak_11 peak_13 ... control_13 control_14
## rowData names(2): GC_content feature_length
## colnames(7): IP1.bam IP2.bam ... Input2.bam Input3.bam
## colData names(3): design_IP design_Treatment sizeFactor

exomePeak2 will export the significant peaks in the format of BED file and CSV table, and the output results will be automatically saved in a folder named exomePeak2_output.

If only IP and input samples are provided, the peak statistics are calculated from the \({\beta}_{i,1}\) terms in the following linear regression design:

\[log2(Q_{i,j}) = {\beta}_{i,0} + {\beta}_{i,1}1(\rho(j)=\text{IP}) + t_{i,j}\]

\(Q_{i,j}\) is the expected value of read abundance of peak \(i\) under sample \(j\). \({\beta}_{i,0}\) is the intercept coefficient, \({\beta}_{i,1}\) is the coefficient of IP/input log2 fold change, \(1(\rho(j)=\text{IP})\) is the regression covariate, which is the indicator variable of the sample \(j\) being IP sample. \(t_{i,j}\) is the peak specific offsets account for the sequencing depth variation and the GC content biases. \(t{i,j}\) is estimated by the median regression method implemented in the package cqn[1]; the median regression is smoothed using the natural cubic splines with the knots defined in function cqn::cqn(), and the threshold for read count for regression analysis is set at 50. By default, the quantile normalization in cqn is turned off, so the peak specific offsets are calculated only based on the sequencing depth and GC content biases.

In the default setting, the fitted linear models are the regularized GLM (Generalized Linear Model) of NB implemented in DESeq2[2]. If there is no biological variation in one of the IP and input groups, Poisson GLM will be fitted for peak detection. Both the size factors used in cqn and DESeq2 are estimated by the robust size factor estimator defined in the function DESeq2::estimateSizeFactors() with type = “ratio”.

For peak calling, 2 rounds of GLM will be fitted. The first round of GLM fitting will identify the significantly modified sliding windows; the detection criteria is Wald test p-values < 1e-05 under the alternative hypothesis of \({\beta}_{i,1} > 0\). The second round of GLM fitting will re-calculate the peak statistics from the merged sliding windows (peaks). For the two rounds of GLM fitting, the correction offsets \(t_{i,j}\) will be recalculated from the GC contents of the sliding windows and the peaks, respectively.

Annotations on the columns of the output table:

4 Differential Modification Analysis

The following example performs differential modification analysis (comparison of two biological conditions) on the exon regions defined by the transcript annotation.

In the differential modification mode, exomePeak2 will first perform Peak calling on the exon regions using both the control and treated samples. Next, it will conduct the differential modification analysis on the significant peaks using GLMs of interactive designs.

f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           bam_treated_input = TREATED_INPUT_BAM,
           bam_treated_ip = TREATED_IP_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE)
## Warning in glmDM(sep, LFC_shrinkage = LFC_shrinkage, glm_type = glm_type): At least one of the IP or input group has no replicates. Quantification method changed to Poisson GLM.
## class: SummarizedExomePeak 
## dim: 23 9 
## metadata(0):
## assays(2): counts GCsizeFactors
## rownames(23): peak_10 peak_11 ... control_5 control_6
## rowData names(2): GC_content feature_length
## colnames(9): IP1.bam IP2.bam ... treated_IP1.bam treated_Input1.bam
## colData names(3): design_IP design_Treatment sizeFactor

In the differential modification mode, exomePeak2 will export the differentially modified peaks in the format of BED file and CSV table, and the outputs will also be saved in a folder named exomePeak2_output.

The peak statistics calculated in the differential modification setting are based on the interactive coefficient \({\beta}_{i,3}\) in the following regression design:

\[log2(Q_{i,j}) = {\beta}_{i,0} + {\beta}_{i,1}1(\rho(j)=\text{IP}) + {\beta}_{i,2}1(\rho(j)=\text{Treatment}) + {\beta}_{i,3}1(\rho(j)=\text{IP&Treatment}) + t_{i,j}\]

Annotations on the additional columns of the output table for differential analysis:

By the default setup of exomePeak2(), the sequencing depth for the interactive GLM will be estimated from the background features, which are the disjoint regions of the detected peaks on exons.

5 Reference based Quantification and Differencial Analysis

exomePeak2 supports modification quantification and differential modification analysis based on single nucleotides modification annotation. Some data sets in epitranscriptomics have single-base resolution, e.x. Data generated by the m6A-CLIP-Seq or m6A-miCLIP-Seq technologies. Compared with the peaks called from MeRIP-Seq, sites identified by single-base resolution techniques often have higher precision.

By eliminating of the technical variation introduced by the peak lengths, the read counting at the single-base modification sites can provide more accurate and reliable quantification for the MeRIP-Seq experiments. exomePeak2 will automatically initiate the mode of single-base modification quantification/differential analysis when user provide a reference object in the argument mod_annot.

The GLM designs used in the reference based mode are identical with the quantification and differential analysis under the peak calling mode. The difference is that only one round of GLM fitting is performed in the reference mode by treating the the single-base modification site as the detected peaks.

The single-base annotation information should be provided to the exomePeak2 function in GRanges object, the GRanges object can be imported directly from the BED files of single-base annotation files using dplyr::import().

f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")

MOD_ANNO_GRANGE <- readRDS(f2)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE,
           mod_annot = MOD_ANNO_GRANGE)
## class: SummarizedExomePeak 
## dim: 171 7 
## metadata(0):
## assays(2): '' GCsizeFactors
## rownames(171): peak_1 peak_2 ... control_83 control_84
## rowData names(2): GC_content feature_length
## colnames(7): IP1.bam IP2.bam ... Input2.bam Input3.bam
## colData names(3): design_IP design_Treatment sizeFactor

Under the reference based mode, exomePeak2 will export the analysis results in the format of BED and CSV files, while the order of the rows in the files will match the corresponding ranges of the single-base GRanges.

6 Multi-step Functions

The exomePeak2 package can achieve peak calling and peak statistics calculation with multiple functions.

1. Check the BAM files of MeRIP-seq experiments

The scanMeripBAM() is used to organize the BAM files used in a MeRIP-Seq analysis, it also specify the important BAM reading parameters such as the paired end library type and the FLAG filters.

MeRIP_Seq_Alignment <- scanMeripBAM(
                         bam_ip = IP_BAM,
                         bam_input = INPUT_BAM,
                         paired_end = FALSE
                        )

For MeRIP-seq experiment with interactive design (contain control and treatment groups), we can add the treated input and IP files with the arguments bam_treated_input and bam_treated_ip:

MeRIP_Seq_Alignment <- scanMeripBAM(
    bam_ip = IP_BAM,
    bam_input = INPUT_BAM,
    bam_treated_input = TREATED_INPUT_BAM,
    bam_treated_ip = TREATED_IP_BAM,
    paired_end = FALSE
  ) 

2. Perform peak calling on exons

After checking the MeRIP-seq BAMs, peak calling can be performed with exomePeakCalling(). The function will first generate sliding windows on exons of the annotation file, and it will then count the read ans it GLMs to detect the significantly modified peaks. If either genome or bsgenome arguments are provided, the GC content bias correction will be performed while peak calling.

SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
                                         gff_dir = GENE_ANNO_GTF,
                                         genome = "hg19") 

Alternatively, user can provide single-base RNA modification annotation at mod_annot, so that the quantification will be performed on the reference sites.

SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
                                         gff_dir = GENE_ANNO_GTF,
                                         genome = "hg19",
                                         mod_annot = MOD_ANNO_GRANGE) 

3. Estimate correction factors

Before performing modification level quantification or differential analysis on the peaks or sites, the normalization methods used will be controlled by the functions estimateSeqDepth() and normalizeGC(). User can combine multiple normalization approaches, such as adding the conditional quantile normalization introduced in [2].

SummarizedExomePeaks <- estimateSeqDepth(SummarizedExomePeaks)
SummarizedExomePeaks <- normalizeGC(SummarizedExomePeaks)

P.S. The GC content bias factors can still be estimated by normalizeGC() if no genome sequence information are provided in the previous steps, but genome information need to be provided at normalizeGC() via the bsgenome argument.

In addition, the sequencing depth and GC content offset can be calculated according to 3 types of feature ranges: “All” (default), “Background” and “Modification”. The impact of different estimation scopes are highly data dependent, and reasonable changes of these may significantly improve the biological outcomes of the downstream analysis. For example, in differential analysis, it is recommended to use estimateSeqDepth(SummarizedExomePeaks, from='Background') so that the direction of differential modification directions can be more consistent with the expectation of the perturbed protein factors.

4. Report the GLM statistics

Using the normalization factors calculated above, the GLM statistics of the IP/input design can be calculated with the function glmM():

SummarizedExomePeaks <- glmM(SummarizedExomePeaks) 

If the treated IP and input BAM are provided, glmDM() function can be used to conduct differential modification analysis on modification Peaks with interactive GLM:

SummarizedExomePeaks <- glmDM(SummarizedExomePeaks)

All of the GLM statistics and normalization factors calculated will be contained in the SummarizedExomePeaks object, which can be saved (ex. with saveRDS()) to store the intermediate results in a given pipeline of analysis.

5. Scatter plot between GC content and log2 fold changes (LFCs).

A scatter plot can be produced with plotLfcGC() to visualize the trend of LFCs with GC. It is recommended to plot it twice before and after conducting the GC content normalization using normalizeGC().

plotLfcGC(SummarizedExomePeaks, point_size = 1, xlim = c(0.4, 0.85)) 

6. Bar plot of sequencing depth estimates

The library size factors can be visualized and compared using plotSizeFactors(). The label will allow comparison of 3 types of feature ranges for size factor calculations, including the background, modification and both regions (All).

plotSizeFactors(SummarizedExomePeaks)

7. Export the modification peaks and their statistics

The analysis results can be saved on the system in formats of CSV, BED, or RDS.

exportResults(SummarizedExomePeaks, format = "BED") 

Please note that exportResults() can also specify the filtering thresholds and style of the result table.

7 Contact

If you encounter any problems during use, please contact the maintainer of exomePeak2:

Zhen Wei :

8 References

  1. KD Hansen, RA Irizarry, and Z Wu, Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 2012 vol. 13(2) pp. 204-216.

  2. Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8

  3. Zhu A, Ibrahim JG, Love MI (2018). “Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences.” Bioinformatics. doi: 10.1093/bioinformatics/bty895.

9 Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.62.0                  
##  [3] rtracklayer_1.54.0                Biostrings_2.62.0                
##  [5] XVector_0.34.0                    exomePeak2_1.6.0                 
##  [7] cqn_1.40.0                        quantreg_5.86                    
##  [9] SparseM_1.81                      preprocessCore_1.56.0            
## [11] nor1mix_1.3-0                     mclust_5.4.7                     
## [13] SummarizedExperiment_1.24.0       Biobase_2.54.0                   
## [15] GenomicRanges_1.46.0              GenomeInfoDb_1.30.0              
## [17] IRanges_2.28.0                    S4Vectors_0.32.0                 
## [19] BiocGenerics_0.40.0               MatrixGenerics_1.6.0             
## [21] matrixStats_0.61.0                BiocStyle_2.22.0                 
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2         rjson_0.2.20             ellipsis_0.3.2          
##   [4] farver_2.1.0             MatrixModels_0.5-0       bit64_4.0.5             
##   [7] mvtnorm_1.1-3            AnnotationDbi_1.56.0     fansi_0.5.0             
##  [10] apeglm_1.16.0            xml2_1.3.2               cachem_1.0.6            
##  [13] geneplotter_1.72.0       knitr_1.36               jsonlite_1.7.2          
##  [16] Rsamtools_2.10.0         annotate_1.72.0          dbplyr_2.1.1            
##  [19] png_0.1-7                BiocManager_1.30.16      compiler_4.1.1          
##  [22] httr_1.4.2               assertthat_0.2.1         Matrix_1.3-4            
##  [25] fastmap_1.1.0            htmltools_0.5.2          prettyunits_1.1.1       
##  [28] tools_4.1.1              coda_0.19-4              gtable_0.3.0            
##  [31] glue_1.4.2               GenomeInfoDbData_1.2.7   reshape2_1.4.4          
##  [34] dplyr_1.0.7              rappdirs_0.3.3           Rcpp_1.0.7              
##  [37] bbmle_1.0.24             jquerylib_0.1.4          vctrs_0.3.8             
##  [40] conquer_1.0.2            xfun_0.27                stringr_1.4.0           
##  [43] lifecycle_1.0.1          restfulr_0.0.13          XML_3.99-0.8            
##  [46] zlibbioc_1.40.0          MASS_7.3-54              scales_1.1.1            
##  [49] ragg_1.1.3               hms_1.1.1                parallel_4.1.1          
##  [52] RColorBrewer_1.1-2       yaml_2.2.1               curl_4.3.2              
##  [55] memoise_2.0.0            ggplot2_3.3.5            emdbook_1.3.12          
##  [58] sass_0.4.0               bdsmatrix_1.3-4          biomaRt_2.50.0          
##  [61] stringi_1.7.5            RSQLite_2.2.8            highr_0.9               
##  [64] genefilter_1.76.0        BiocIO_1.4.0             GenomicFeatures_1.46.0  
##  [67] filelock_1.0.2           BiocParallel_1.28.0      systemfonts_1.0.3       
##  [70] rlang_0.4.12             pkgconfig_2.0.3          bitops_1.0-7            
##  [73] evaluate_0.14            lattice_0.20-45          purrr_0.3.4             
##  [76] labeling_0.4.2           GenomicAlignments_1.30.0 bit_4.0.4               
##  [79] tidyselect_1.1.1         plyr_1.8.6               magrittr_2.0.1          
##  [82] bookdown_0.24            DESeq2_1.34.0            R6_2.5.1                
##  [85] magick_2.7.3             generics_0.1.1           DelayedArray_0.20.0     
##  [88] DBI_1.1.1                pillar_1.6.4             survival_3.2-13         
##  [91] KEGGREST_1.34.0          RCurl_1.98-1.5           tibble_3.1.5            
##  [94] crayon_1.4.1             utf8_1.2.2               BiocFileCache_2.2.0     
##  [97] rmarkdown_2.11           progress_1.2.2           locfit_1.5-9.4          
## [100] grid_4.1.1               blob_1.2.2               digest_0.6.28           
## [103] xtable_1.8-4             numDeriv_2016.8-1.1      textshaping_0.3.6       
## [106] munsell_0.5.0            bslib_0.3.1