Cancer is a genetic disease caused by somatic mutations in genes controlling key biological functions such as cellular growth and division. Such mutations may arise both through cell-intrinsic and exogenous processes, generating characteristic mutational patterns over the genome named mutational signatures. The study of mutational signatures have become a standard component of modern genomics studies, since it can reveal which (environmental and endogenous) mutagenic processes are active in a tumor, and may highlight markers for therapeutic response.

Mutational signatures computational analyses fall mostly within two categories: (i) de novo signatures extraction and (ii) signatures exposure estimation. In the first case, the presence of mutational processes is first assessed from the data, signatures are identified and extracted and finally assigned to samples. This task is typically performed by Non-Negative Matrix Factorization (NMF). While other approaches have been proposed, NMF-based methods are by far the most used. The estimation of signatures exposures is performed by holding a set of signatures fixed (see, e.g., COSMIC mutational signatures catalogue) and assigning them to samples by minimizing, e.g., mean squared error between observed and estimated mutational patterns for each sample.

However, available mutational signatures computational tools presents many pitfalls. First, the task of determining the number of signatures is very complex and depends on heuristics. Second, several signatures have no clear etiology, casting doubt on them being computational artifacts rather than due to mutagenic processes. Last, approaches for signatures assignment are greatly influenced by the set of signatures used for the analysis. To overcome these limitations, we developed RESOLVE (Robust EStimation Of mutationaL signatures Via rEgularization), a framework that allows the efficient extraction and assignment of mutational signatures.

The RESOLVE R package implements a new de novo signatures extraction algorithm, which employs a regularized Non-Negative Matrix Factorization procedure. The method incorporates a background signature during the inference step and adopts elastic net regression to reduce the impact of overfitting. The estimation of the optimal number of signatures is performed by bi-cross-validation. Furthermore, the package also provide a procedure for confidence estimation of signatures activities in samples.

As such, RESOLVE represent an addition to other Bioconductor packages, such as, e.g., SparseSignatures, MutationalPatterns, musicatk among others, that implements a novel approach for detecting mutational signatures.

In this vignette, we give an overview of the package by presenting some of its main functions.

# 1 Installing the RESOLVE R package

The RESOLVE package can be installed from Bioconductor as follow.

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("RESOLVE")

# 2 Changelog

• 1.0.0 Package released in Bioconductor 3.16.

# 3 Using the RESOLVE R package

We now present some of the main features of the package. We notice that the package supports different types of mutational signatures such as: SBS (single base substitutions) and MNV (multi-nucleotide variant) (see Degasperi, Andrea, et al. “Substitution mutational signatures in whole-genome–sequenced cancers in the UK population.” Science 376.6591 (2022): abl9283), CX (chromosomal instability) (see Drews, Ruben M., et al. “A pan-cancer compendium of chromosomal instability.” Nature 606.7916 (2022): 976-983) and CN (copy number) signatures (see Steele, Christopher D., et al. “Signatures of copy number alterations in human cancer.” Nature 606.7916 (2022): 984-991). But, for the sake of this vignette, we present only results on the classical SBS signatures. We refer to the manual for details.

First, we show how to load example data and import them into a count matrix to perform the signatures analysis.

library("RESOLVE")
data(ssm560_reduced)

These data are a reduced version of the 560 breast tumors provided by Nik-Zainal, Serena, et al. (2016) comprising only 3 patients. We notice that these data are provided purely as an example, and, as they are a reduced and partial version of the original dataset, they should not be used to draw any biological conclusion.

We now import such data into a count matrix to perform the signatures discovery. To do so, we also need to specify the reference genome as a BSgenome object to be considered. This can be done as follows, where in the example we used hs37d5 as reference genome as provided as data object within the package.

library("BSgenome.Hsapiens.1000genomes.hs37d5")
imported_data = getSBSCounts(data = ssm560_reduced, reference = BSgenome.Hsapiens.1000genomes.hs37d5)
head(imported_data)
##          A[C>A]A A[C>A]C A[C>A]G A[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T
## PD10010a      37      25       8      24      35       5      16      25
## PD10011a     103      59      16      73     113      54      31     102
## PD10014a     235     241      37     234     158      71      26     180
##          A[C>T]A A[C>T]C A[C>T]G A[C>T]T A[T>A]A A[T>A]C A[T>A]G A[T>A]T
## PD10010a      49      31     100      42      21      15      17      30
## PD10011a     116      73     228     109      61      70      56     165
## PD10014a     229      89     178     186     105      90     126     174
##          A[T>C]A A[T>C]C A[T>C]G A[T>C]T A[T>G]A A[T>G]C A[T>G]G A[T>G]T
## PD10010a      48      20      29      44       8       6      10      23
## PD10011a     184     116     113     169      77      41      73     105
## PD10014a     261     122     167     211      76      27      84      59
##          C[C>A]A C[C>A]C C[C>A]G C[C>A]T C[C>G]A C[C>G]C C[C>G]G C[C>G]T
## PD10010a      34      28       8      23      15      19      20      26
## PD10011a     105      75      30     102      60      37      22      65
## PD10014a     244     238      35     243     107     105      40     144
##          C[C>T]A C[C>T]C C[C>T]G C[C>T]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T
## PD10010a      48      37      55      43      12       7      18      16
## PD10011a      71      52     108     103     116      80      89     103
## PD10014a     136     124     144     197     116     139     145     217
##          C[T>C]A C[T>C]C C[T>C]G C[T>C]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T
## PD10010a      14      17      20      30       6       8       5      13
## PD10011a     103      78     102     158      40      65      55     188
## PD10014a     103     144     112     129      47      54      70     107
##          G[C>A]A G[C>A]C G[C>A]G G[C>A]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T
## PD10010a      31      22      11      22       6      12       9      14
## PD10011a      78      50      14      55      55      66      13      87
## PD10014a     146     126      24     160      63      70      25     120
##          G[C>T]A G[C>T]C G[C>T]G G[C>T]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T
## PD10010a      40      32      82      25       6       6       6      13
## PD10011a      76      63     118      81      69      41      56      86
## PD10014a     141      99     180     163      62      66      83     126
##          G[T>C]A G[T>C]C G[T>C]G G[T>C]T G[T>G]A G[T>G]C G[T>G]G G[T>G]T
## PD10010a      22       9      16      24       7       1       8      10
## PD10011a      96      62      82      93      56      46      35      99
## PD10014a     110      81     102     135      32      18      61      78
##          T[C>A]A T[C>A]C T[C>A]G T[C>A]T T[C>G]A T[C>G]C T[C>G]G T[C>G]T
## PD10010a      40      40      12      48      54      37      12      85
## PD10011a      78      80      12      83     116     104      29     194
## PD10014a     202     191      17     253     198     159      33     325
##          T[C>T]A T[C>T]C T[C>T]G T[C>T]T T[T>A]A T[T>A]C T[T>A]G T[T>A]T
## PD10010a      67      55      53      71      39      13       3      35
## PD10011a     119      94      78     126     121      43      64      91
## PD10014a     188     153      93     184     124      89      73     221
##          T[T>C]A T[T>C]C T[T>C]G T[T>C]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T
## PD10010a      19      13      11      25      18      11      11      35
## PD10011a     125      79      83     113      68      90     140     251
## PD10014a     143     118      75     148      71      54      76     160

Now, we present an example of visualization feature provided by the package, showing the counts for the first patient, i.e., PD10010a, in the following plot.

patientsSBSPlot(trinucleotides_counts=imported_data,samples="PD10010a")

After the data are loaded, we can perform signatures de novo extraction. To do so, we need to define a range for the number of signatures (variable K) to be considered. We now show how to perform the inference on the first 3 patients of the dataset from Nik-Zainal, Serena, et al. (2016), whose counts are provided within the package.

data(background)
data(patients)
set.seed(12345)
res_denovo = signaturesDecomposition(x = patients[1:3,],
K = 3:4,
background_signature = background,
nmf_runs = 2,
sparsify = FALSE,
num_processes = 1)
## Performing signatures discovery and rank estimation...
## Performing inference for K=3...
## Performing inference for K=4...

We notice that this function can be also used to perform de novo estimation for other types of mutational signatures, such as SBS, MNV, CX and CN.

Now that we have performed the de novo inferece, we need to decide the optimal number of signatures to be extracted from our dataset. To do so, we provide a procedure based on cross-validation.

set.seed(12345)
res_cv = signaturesCV(x = patients[1:3,],
beta = res_denovo$beta, cross_validation_iterations = 2, cross_validation_repetitions = 2, num_processes = 1) ## Estimating the optimal number of signatures with a total of 2 cross validation repetitions... ## Performing repetition 1 out of 2... ## Performing estimation for K=3... ## Performing cross validation iteration 1 out of 2... ## Performing cross validation iteration 2 out of 2... ## Progress 50%... ## Performing estimation for K=4... ## Performing cross validation iteration 1 out of 2... ## Performing cross validation iteration 2 out of 2... ## Progress 100%... ## Performing repetition 2 out of 2... ## Performing estimation for K=3... ## Performing cross validation iteration 1 out of 2... ## Performing cross validation iteration 2 out of 2... ## Progress 50%... ## Performing estimation for K=4... ## Performing cross validation iteration 1 out of 2... ## Performing cross validation iteration 2 out of 2... ## Progress 100%... We notice that the computations for this task can be very time consuming, expecially when many iterations of cross validations are performed (see manual) and a large set of configurations of the parameters are tested. We conclude this vignette by plotting the discovered signatures for the best configuration, i.e., K = 4. signatures = res_denovo$beta[[2]]
signaturesSBSPlot(beta=signatures, xlabels=FALSE)

We refer to the manual for a detailed description of each parameter and to the RESOLVE manuscript for details on the method.

# 4 Current R Session

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_GB              LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods
## [8] base
##
## other attached packages:
##  [1] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
##  [2] BSgenome_1.66.0
##  [3] rtracklayer_1.58.0
##  [4] Biostrings_2.66.0
##  [5] XVector_0.38.0
##  [6] GenomicRanges_1.50.0
##  [7] GenomeInfoDb_1.34.0
##  [8] IRanges_2.32.0
##  [9] S4Vectors_0.36.0
## [10] RESOLVE_1.0.0
## [11] NMF_0.24.0
## [12] bigmemory_4.6.1
## [13] Biobase_2.58.0
## [14] BiocGenerics_0.44.0
## [15] cluster_2.1.4
## [16] rngtools_1.5.2
## [17] pkgmaker_0.32.2
## [18] registry_0.5-1
## [19] BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
##  [1] lsa_0.73.3                  bitops_1.0-7
##  [3] matrixStats_0.62.0          doParallel_1.0.17
##  [5] RColorBrewer_1.1-3          SnowballC_0.7.0
##  [7] tools_4.2.1                 bslib_0.4.0
##  [9] utf8_1.2.2                  R6_2.5.1
## [11] DBI_1.1.3                   colorspace_2.0-3
## [13] withr_2.5.0                 tidyselect_1.2.0
## [15] gridExtra_2.3               compiler_4.2.1
## [17] glmnet_4.1-4                cli_3.4.1
## [19] DelayedArray_0.24.0         labeling_0.4.2
## [21] bookdown_0.29               sass_0.4.2
## [23] scales_1.2.1                nnls_1.4
## [25] stringr_1.4.1               digest_0.6.30
## [27] Rsamtools_2.14.0            rmarkdown_2.17
## [29] pkgconfig_2.0.3             htmltools_0.5.3
## [31] MatrixGenerics_1.10.0       highr_0.9
## [33] fastmap_1.1.0               rlang_1.0.6
## [35] farver_2.1.1                shape_1.4.6
## [37] jquerylib_0.1.4             BiocIO_1.8.0
## [39] generics_0.1.3              jsonlite_1.8.3
## [41] BiocParallel_1.32.0         dplyr_1.0.10
## [43] RCurl_1.98-1.9              magrittr_2.0.3
## [45] GenomeInfoDbData_1.2.9      Matrix_1.5-1
## [47] Rcpp_1.0.9                  munsell_0.5.0
## [49] fansi_1.0.3                 lifecycle_1.0.3
## [51] stringi_1.7.8               yaml_2.3.6
## [53] SummarizedExperiment_1.28.0 zlibbioc_1.44.0
## [55] plyr_1.8.7                  grid_4.2.1
## [57] parallel_4.2.1              bigmemory.sri_0.1.3
## [59] crayon_1.5.2                lattice_0.20-45
## [61] splines_4.2.1               magick_2.7.3
## [63] knitr_1.40                  pillar_1.8.1
## [65] uuid_1.1-0                  rjson_0.2.21
## [67] reshape2_1.4.4              codetools_0.2-18
## [69] XML_3.99-0.12               glue_1.6.2
## [71] evaluate_0.17               data.table_1.14.4
## [73] BiocManager_1.30.19         vctrs_0.5.0
## [75] foreach_1.5.2               gtable_0.3.1
## [77] assertthat_0.2.1            cachem_1.0.6
## [79] ggplot2_3.3.6               xfun_0.34
## [81] gridBase_0.4-7              xtable_1.8-4
## [83] restfulr_0.0.15             survival_3.4-0
## [85] tibble_3.1.8                iterators_1.0.14
## [87] GenomicAlignments_1.34.0