Contents

1 Introduction

This document gives an introduction to and overview of inferring the clone identity of cells using the cardelino package using a given clonal structure.

cardelino can use variant information extracted from single-cell RNA-seq reads to probabilistically assign single-cell transcriptomes to individual clones.

Briefly, cardelino is based on a Bayesian mixture model with a beta-binomial error model to account for sequencing errors as well as a gene-specific model for allelic imbalance between haplotypes and associated bias in variant detection. Bayesian inference allows the model to account for uncertainty in model parameters and cell assignments.

We assume that clones are tagged by somatic mutations, and that these mutations are known (e.g. from exome sequencing or equivalent). Given a set of known mutations, these sites can be interrogated in scRNA-seq reads to obtain evidence for the presence or absence of each mutation in each cell. As input, the model requires the count of reads supporting the alternative (mutant) allele at each mutation site, the total number of reads overlapping the mutation site (“coverage”).

Typically, coverage of somatic mutations in scRNA-seq data is very sparse (most mutation sites in a given cell have no read coverage), but the cardelino model accounts for this sparsity and aggregates information across all available mutation sites to infer clonal identity.

2 Clone ID with a clonal tree provided

In many clone ID scenarios, a clonal tree is known. That is, we have been able to infer the clones present in the sampled cell population, for example using bulk or single-cell DNA-seq data, and we know which mutations are expected to be present in which clones.

To infer the clonal identity of cells when a clonal tree is provided, cardelino requires the following input data:

The configuration matrix, Config, can be provided by other tools used to infer the clonal structure of the cell population. For example, the package Canopy can be used to infer a clonal tree from DNA-seq data and the “Z” element of its output is the configuration matrix.

Here, we demonstrate the use of cardelino to assign 77 cells to clones identified with Canopy using 112 somatic mutations.

We load the package and the example clone ID dataset distributed with the package in VCF (variant call format) format, which is mostly used for storing genotype data. Here, the cellSNP.cells.vcf.gz is computed by using cellsnp-lite on a list pre-identified somatic variants from bulk WES.

library(ggplot2)
library(cardelino)

There are many possible ways to extract the data required by cardelino from a VCF file, here we show just one approach using convenience functions in cardelino:

vcf_file <- system.file("extdata", "cellSNP.cells.vcf.gz", 
                        package = "cardelino")
input_data <- load_cellSNP_vcf(vcf_file)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 37
##   header_line: 38
##   variant count: 112
##   column count: 86
## 
Meta line 37 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 112
##   Character matrix gt cols: 86
##   skip: 0
##   nrows: 112
##   row_num: 0
## 
Processed variant: 112
## All variants processed
## [1] "112 out of 112 SNPs passed."

Alternatively you can load the A and D matrices yourself and combine them into a list, for example input_data = list('A' = A, 'D' = D).

We can visualize the allele frequency of the mutation allele. As expected, the majority of entries are missing (in grey) due to the high sparsity in scRNA-seq data. For the same reason, even for the non-missing entries, the estimate of allele frequency is of high uncertainty. For this reason, it is crucial to probabilistic clustering with accounting the uncertainty, ideally with guide clonal tree structure from external data.

AF <- as.matrix(input_data$A / input_data$D)

p = pheatmap::pheatmap(AF, cluster_rows=FALSE, cluster_cols=FALSE,
                   show_rownames = TRUE, show_colnames = TRUE,
                   labels_row='77 cells', labels_col='112 SNVs',
                   angle_col=0, angle_row=0)

Next, we will load the Canopy tree results for the same individual. The clonal tree inferred by Canopy for this donor consists of three clones, including a “base” clone (“clone1”) that has no subclonal somatic mutations present.

canopy_res <- readRDS(system.file("extdata", "canopy_results.coveraged.rds", 
                                  package = "cardelino"))
Config <- canopy_res$tree$Z

Be careful to ensure that the same variant IDs are used in both data sources.

rownames(Config) <- gsub(":", "_", rownames(Config))

We can visualize the clonal tree structure obtained from Canopy:

plot_tree(canopy_res$tree, orient = "v")

The included dataset contains the A and D matrices, so combined with the Canopy tree object provided, we have the necessary input to probabilistically assign cells to clones. Note, min_iter = 800, max_iter = 1200 is used only for quick illustration. Please remove them for the default values or set higher number of iterations to ensure convergence of the Gibbs sampling. Convergence is checked automatically in clone_id(), using the Geweke z-statistic.

set.seed(7)
assignments <- clone_id(input_data$A, input_data$D, Config = Config,
                        min_iter = 800, max_iter = 1200)
## 100% converged.
## [1] "Converged in 800 iterations."
## DIC: 1398.26 D_mean: 1232.3 D_post: 1201.67 logLik_var: 49.15
names(assignments)
##  [1] "theta0"         "theta1"         "theta0_all"     "theta1_all"    
##  [5] "element"        "logLik"         "prob_all"       "prob"          
##  [9] "prob_variant"   "relax_rate"     "Config_prob"    "Config_all"    
## [13] "relax_rate_all" "DIC"            "n_chain"

We can visualise the cell-clone assignment probabilities as a heatmap.

prob_heatmap(assignments$prob)

We recommend assigning a cell to the highest-probability clone if the highest posterior probability is greater than 0.5 and leaving cells “unassigned” if they do not reach this threshold. The assign_cells_to_clones function conveniently assigns cells to clones based on a threshold and returns a data.frame with the cell-clone assignments.

df <- assign_cells_to_clones(assignments$prob)
head(df)
##         cell  clone  prob_max
## 1 ERR2806034 clone2 1.0000000
## 2 ERR2806035 clone1 0.9999962
## 3 ERR2806036 clone1 0.9998902
## 4 ERR2806037 clone1 0.9996968
## 5 ERR2806038 clone1 0.9999999
## 6 ERR2806039 clone2 1.0000000
table(df$clone)
## 
##     clone1     clone2     clone3 unassigned 
##         44         24          7          2

Also, Cardelino will update the guide clonal tree Config matrix (as a prior) and return a posterior estimate. In the figure below, negative value means the probability of a certain variant presents in a certain clone is reduced in posterior compared to prior (i.e., the input Config). Vice verse for the positive values.

heat_matrix(t(assignments$Config_prob - Config)) + 
  scale_fill_gradient2() +
  ggtitle('Changes of clonal Config') + 
  labs(x='Clones', y='112 SNVs') +
  theme(axis.text.y = element_blank(), legend.position = "right")

Finally, we can visualize the results cell assignment and updated mutations clonal configuration at the raw allele frequency matrix:

AF <- as.matrix(input_data$A / input_data$D)
cardelino::vc_heatmap(AF, assignments$prob, Config, show_legend=TRUE)

3 Clone ID without input clonal tree

As shown above, the Config can be updated by the observations from scRNA-seq data. A step further is to not including the Config entirely. This can be possible, as there could be no clonal tree available. In this case, you can use cardelino in its de-novo by set Config=NULL and set a number for n_clones. Note, by default we will keep the first clone as base clone, i.e., no mutations. You can turn it off by set keep_base_clone=FALSE.

assignments <- clone_id(input_data$A, input_data$D, Config=NULL, n_clone = 3)

3.1 Clone ID on mitochondrial variations

As an further illustration, we will show how cardelino can be used to infer the clonal structure from mitochondrial variations, that called from MQuad. Again, we have included the MQuad output in .mtx format in the cardelino package. Note, this mitochondrial data is from the same SMART-seq data set above.

First, let’s import these two AD and DP matrices and together with their variant names.

AD_file <- system.file("extdata", "passed_ad.mtx", package = "cardelino")
DP_file <- system.file("extdata", "passed_dp.mtx", package = "cardelino")
id_file <- system.file("extdata", "passed_variant_names.txt", 
                       package = "cardelino")

AD <- Matrix::readMM(AD_file)
DP <- Matrix::readMM(DP_file)
var_ids <- read.table(id_file, )
rownames(AD) <- rownames(DP) <- var_ids[, 1]
colnames(AD) <- colnames(DP) <- paste0('Cell', seq(ncol(DP)))

Same as above, AD and DP are matrices with variants as rows and cells as column. In this case we have 25 variants (rows) and 77 cells (columns). As expected, the coverage of mtDNA is a lot higher than the nuclear genome above (much fewer missing values).

AF_mt <- as.matrix(AD / DP)
pheatmap::pheatmap(AF_mt)

Now, we can run cardelino on the mitochondrial variations. Note, as there is no prior clonal tree, the model is easier to return a local optima. Generally, we recommend running it multiple time (with different random seed or initializations) and pick the one with highest DIC.

set.seed(7)
assign_mtClones <- clone_id(AD, DP, Config=NULL, n_clone = 3, 
                            keep_base_clone=FALSE)
## Config is NULL: de-novo mode is in use.
## 100% converged.
## [1] "Converged in 5000 iterations."
## DIC: NaN D_mean: Inf D_post: Inf logLik_var: NaN

Then visualise allele frequency along with the clustering of cells and variants:

Config_mt <- assign_mtClones$Config_prob
Config_mt[Config_mt >= 0.5] = 1
Config_mt[Config_mt <  0.5] = 0
cardelino::vc_heatmap(AF_mt, assign_mtClones$prob, Config_mt, show_legend=TRUE)

Session information

sessionInfo()
## R Under development (unstable) (2023-10-22 r85388)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] cardelino_1.5.0  ggplot2_3.4.4    knitr_1.44       BiocStyle_2.31.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          jsonlite_1.8.7             
##   [3] magrittr_2.0.3              magick_2.8.1               
##   [5] GenomicFeatures_1.55.0      farver_2.1.1               
##   [7] rmarkdown_2.25              fs_1.6.3                   
##   [9] BiocIO_1.13.0               zlibbioc_1.49.0            
##  [11] vctrs_0.6.4                 memoise_2.0.1              
##  [13] Rsamtools_2.19.0            RCurl_1.98-1.12            
##  [15] ggtree_3.11.0               htmltools_0.5.6.1          
##  [17] S4Arrays_1.3.0              progress_1.2.2             
##  [19] curl_5.1.0                  SparseArray_1.3.0          
##  [21] gridGraphics_0.5-1          sass_0.4.7                 
##  [23] bslib_0.5.1                 cachem_1.0.8               
##  [25] GenomicAlignments_1.39.0    lifecycle_1.0.3            
##  [27] pkgconfig_2.0.3             Matrix_1.6-1.1             
##  [29] R6_2.5.1                    fastmap_1.1.1              
##  [31] GenomeInfoDbData_1.2.11     MatrixGenerics_1.15.0      
##  [33] digest_0.6.33               aplot_0.2.2                
##  [35] colorspace_2.1-0            patchwork_1.1.3            
##  [37] AnnotationDbi_1.65.0        S4Vectors_0.41.0           
##  [39] GenomicRanges_1.55.0        RSQLite_2.3.1              
##  [41] vegan_2.6-4                 labeling_0.4.3             
##  [43] filelock_1.0.2              fansi_1.0.5                
##  [45] mgcv_1.9-0                  httr_1.4.7                 
##  [47] abind_1.4-5                 compiler_4.4.0             
##  [49] bit64_4.0.5                 withr_2.5.1                
##  [51] BiocParallel_1.37.0         DBI_1.1.3                  
##  [53] biomaRt_2.59.0              MASS_7.3-60.1              
##  [55] rappdirs_0.3.3              DelayedArray_0.29.0        
##  [57] rjson_0.2.21                permute_0.9-7              
##  [59] tools_4.4.0                 ape_5.7-1                  
##  [61] glue_1.6.2                  restfulr_0.0.15            
##  [63] nlme_3.1-163                grid_4.4.0                 
##  [65] cluster_2.1.4               memuse_4.2-3               
##  [67] generics_0.1.3              gtable_0.3.4               
##  [69] BSgenome_1.71.0             tidyr_1.3.0                
##  [71] pinfsc50_1.2.0              hms_1.1.3                  
##  [73] xml2_1.3.5                  utf8_1.2.4                 
##  [75] XVector_0.43.0              BiocGenerics_0.49.0        
##  [77] pillar_1.9.0                stringr_1.5.0              
##  [79] yulab.utils_0.1.0           splines_4.4.0              
##  [81] dplyr_1.1.3                 BiocFileCache_2.11.0       
##  [83] treeio_1.27.0               lattice_0.22-5             
##  [85] survival_3.5-7              rtracklayer_1.63.0         
##  [87] bit_4.0.5                   tidyselect_1.2.0           
##  [89] Biostrings_2.71.1           bookdown_0.36              
##  [91] IRanges_2.37.0              SummarizedExperiment_1.33.0
##  [93] snpStats_1.53.0             stats4_4.4.0               
##  [95] xfun_0.40                   Biobase_2.63.0             
##  [97] matrixStats_1.0.0           pheatmap_1.0.12            
##  [99] stringi_1.7.12              lazyeval_0.2.2             
## [101] ggfun_0.1.3                 yaml_2.3.7                 
## [103] evaluate_0.22               codetools_0.2-19           
## [105] tibble_3.2.1                BiocManager_1.30.22        
## [107] ggplotify_0.1.2             cli_3.6.1                  
## [109] munsell_0.5.0               jquerylib_0.1.4            
## [111] Rcpp_1.0.11                 GenomeInfoDb_1.39.0        
## [113] vcfR_1.14.0                 dbplyr_2.3.4               
## [115] png_0.1-8                   XML_3.99-0.14              
## [117] parallel_4.4.0              blob_1.2.4                 
## [119] prettyunits_1.2.0           bitops_1.0-7               
## [121] viridisLite_0.4.2           tidytree_0.4.5             
## [123] VariantAnnotation_1.49.0    scales_1.2.1               
## [125] purrr_1.0.2                 crayon_1.5.2               
## [127] combinat_0.0-8              rlang_1.1.1                
## [129] KEGGREST_1.43.0