DecontPro assess and decontaminate single-cell protein expression data, such as those generated from CITE-seq or Total-seq. The count matrix is decomposed into three matrices, the native, the ambient and the background that represent the contribution from the true protein expression on cells, the ambient material and other non-specific background contamination.
DecontX Package can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("decontX")
Then the package can be loaded in R using the following command:
library(decontX)
To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/decontX.
Here we use an example dataset from SingleCellMultiModal
package.
library(SingleCellMultiModal)
dat <- CITEseq("cord_blood", dry.run = FALSE)
#> Warning: 'ExperimentList' contains 'data.frame' or 'DataFrame',
#> potential for errors with mixed data types
counts <- experiments(dat)$scADT
For this tutorial, we sample only 1000 droplets from the dataset to demonstrate the use of functions. When analyzing your dataset, sub-sampling should be done with caution, as decontPro
approximates contamination profile using the dataset. A biased sampling may introduce bias to the contamination profile approximation.
set.seed(42)
sample_id <- sample(dim(counts)[2], 1000, replace = FALSE)
counts_sample <- counts[, sample_id]
decontPro
requires a vector indicating the cell types of each droplet. Here we use Seurat
for clustering.
library(Seurat)
library(dplyr)
adt_seurat <- CreateSeuratObject(counts_sample, assay = "ADT")
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
adt_seurat <- NormalizeData(adt_seurat, normalization.method = "CLR", margin = 2) %>%
ScaleData(assay = "ADT") %>%
RunPCA(assay = "ADT", features = rownames(adt_seurat), npcs = 10,
reduction.name = "pca_adt") %>%
FindNeighbors(dims = 1:10, assay = "ADT", reduction = "pca_adt") %>%
FindClusters(resolution = 0.5)
#> Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
#> a percentage of total singular values, use a standard svd instead.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 1000
#> Number of edges: 32498
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8567
#> Number of communities: 9
#> Elapsed time: 0 seconds
adt_seurat <- RunUMAP(adt_seurat,
dims = 1:10,
assay = "ADT",
reduction = "pca_adt",
reduction.name = "adtUMAP",
verbose = FALSE)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
DimPlot(adt_seurat, reduction = "adtUMAP", label = TRUE)
FeaturePlot(adt_seurat,
features = c("CD3", "CD4", "CD8", "CD19", "CD14", "CD16", "CD56"))
clusters <- as.integer(Idents(adt_seurat))
You can run DecontPro
by simply:
counts <- as.matrix(counts_sample)
out <- deconPro(counts,
clusters)
Priors (delta_sd
and background_sd
) may need tuning with the help of plotting the decontaminated results. The two priors encode belief if a more spread-out count should be considered contamination vs. native. We tested the default values on datasets ranging 5k to 10k droplets and 10 to 30 ADTs and the results are reasonable. A more contaminated or a smaller dataset may need larger priors. In this tutorial, since we sampled only 1k droplets from the original 10k droplets, we use slightly larger priors:
counts <- as.matrix(counts_sample)
out <- decontPro(counts,
clusters,
delta_sd = 2e-4,
background_sd = 2e-5)
#> Chain 1: ------------------------------------------------------------
#> Chain 1: EXPERIMENTAL ALGORITHM:
#> Chain 1: This procedure has not been thoroughly tested and may be unstable
#> Chain 1: or buggy. The interface is subject to change.
#> Chain 1: ------------------------------------------------------------
#> Chain 1:
#> Chain 1:
#> Chain 1:
#> Chain 1: Gradient evaluation took 0.017134 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 171.34 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Begin eta adaptation.
#> Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
#> Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
#> Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
#> Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
#> Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
#> Chain 1: Success! Found best value [eta = 1] earlier than expected.
#> Chain 1:
#> Chain 1: Begin stochastic gradient ascent.
#> Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
#> Chain 1: 100 -513803659.665 1.000 1.000
#> Chain 1: 200 -7177548.141 35.792 70.585
#> Chain 1: 300 -342758.348 30.508 19.941
#> Chain 1: 400 -253644.497 22.969 19.941
#> Chain 1: 500 -223785.346 18.402 1.000
#> Chain 1: 600 -212411.740 15.344 1.000
#> Chain 1: 700 -206476.954 13.156 0.351
#> Chain 1: 800 -202104.000 11.514 0.351
#> Chain 1: 900 -199781.145 10.236 0.133
#> Chain 1: 1000 -198758.154 9.213 0.133
#> Chain 1: 1100 -198020.264 8.376 0.054 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1200 -197397.014 7.678 0.054 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1300 -196906.436 7.088 0.029 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1400 -196442.570 6.582 0.029 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1500 -196230.683 6.143 0.022 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1600 -195981.145 5.759 0.022 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1700 -195830.350 5.420 0.012 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1800 -195732.312 5.119 0.012 MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 1900 -195651.552 4.850 0.005 MEDIAN ELBO CONVERGED MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:
#> Chain 1: Drawing a sample of size 1000 from the approximate posterior...
#> Chain 1: COMPLETED.
#> Warning: Pareto k diagnostic value is 41.49. Resampling is disabled. Decreasing
#> tol_rel_obj may help if variational algorithm has terminated prematurely.
#> Otherwise consider using sampling instead.
The output contains three matrices, and model parameters after inference.
decontaminated_counts
represent the true protein expression on cells.
decontaminated_counts <- out$decontaminated_counts
Plot ADT density before and after decontamination. For bimodal ADTs, the background peak should be removed. Note CD4 is tri-modal with the intermediate mode corresponding to monocytes. This can be used as a QC metric for decontamination as only the lowest mode should be removed.
plotDensity(counts,
decontaminated_counts,
c("CD3", "CD4", "CD8", "CD14", "CD16", "CD19"))
We can also visualize the decontamination by each cell cluster.
plotBoxByCluster(counts,
decontaminated_counts,
clusters,
c("CD3", "CD4", "CD8", "CD16"))
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 Seurat_5.1.0
#> [3] SeuratObject_5.0.2 sp_2.1-4
#> [5] SingleCellMultiModal_1.17.4 MultiAssayExperiment_1.32.0
#> [7] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [9] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [11] IRanges_2.40.0 S4Vectors_0.44.0
#> [13] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [15] matrixStats_1.4.1 decontX_1.4.0
#> [17] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.4.1
#> [3] later_1.3.2 filelock_1.0.3
#> [5] tibble_3.2.1 polyclip_1.10-7
#> [7] fastDummies_1.7.4 lifecycle_1.0.4
#> [9] StanHeaders_2.32.10 globals_0.16.3
#> [11] lattice_0.22-6 MASS_7.3-61
#> [13] magrittr_2.0.3 plotly_4.10.4
#> [15] sass_0.4.9 rmarkdown_2.28
#> [17] jquerylib_0.1.4 yaml_2.3.10
#> [19] httpuv_1.6.15 sctransform_0.4.1
#> [21] spam_2.11-0 pkgbuild_1.4.5
#> [23] spatstat.sparse_3.1-0 reticulate_1.39.0
#> [25] DBI_1.2.3 cowplot_1.1.3
#> [27] pbapply_1.7-2 RColorBrewer_1.1-3
#> [29] abind_1.4-8 zlibbioc_1.52.0
#> [31] Rtsne_0.17 purrr_1.0.2
#> [33] rappdirs_0.3.3 GenomeInfoDbData_1.2.13
#> [35] ggrepel_0.9.6 inline_0.3.19
#> [37] irlba_2.3.5.1 listenv_0.9.1
#> [39] spatstat.utils_3.1-0 goftest_1.2-3
#> [41] RSpectra_0.16-2 spatstat.random_3.3-2
#> [43] fitdistrplus_1.2-1 parallelly_1.38.0
#> [45] leiden_0.4.3.1 codetools_0.2-20
#> [47] DelayedArray_0.32.0 tidyselect_1.2.1
#> [49] UCSC.utils_1.2.0 farver_2.1.2
#> [51] BiocFileCache_2.14.0 spatstat.explore_3.3-3
#> [53] jsonlite_1.8.9 progressr_0.15.0
#> [55] ggridges_0.5.6 survival_3.7-0
#> [57] tools_4.4.1 ica_1.0-3
#> [59] Rcpp_1.0.13 glue_1.8.0
#> [61] BiocBaseUtils_1.8.0 gridExtra_2.3
#> [63] SparseArray_1.6.0 xfun_0.48
#> [65] loo_2.8.0 withr_3.0.2
#> [67] combinat_0.0-8 BiocManager_1.30.25
#> [69] fastmap_1.2.0 MCMCprecision_0.4.0
#> [71] fansi_1.0.6 digest_0.6.37
#> [73] R6_2.5.1 mime_0.12
#> [75] colorspace_2.1-1 scattermore_1.2
#> [77] tensor_1.5 RSQLite_2.3.7
#> [79] spatstat.data_3.1-2 utf8_1.2.4
#> [81] tidyr_1.3.1 generics_0.1.3
#> [83] data.table_1.16.2 httr_1.4.7
#> [85] htmlwidgets_1.6.4 S4Arrays_1.6.0
#> [87] uwot_0.2.2 pkgconfig_2.0.3
#> [89] gtable_0.3.6 blob_1.2.4
#> [91] lmtest_0.9-40 SingleCellExperiment_1.28.0
#> [93] XVector_0.46.0 htmltools_0.5.8.1
#> [95] dotCall64_1.2 bookdown_0.41
#> [97] scales_1.3.0 png_0.1-8
#> [99] SpatialExperiment_1.16.0 spatstat.univar_3.0-1
#> [101] knitr_1.48 rjson_0.2.23
#> [103] reshape2_1.4.4 nlme_3.1-166
#> [105] curl_5.2.3 cachem_1.1.0
#> [107] zoo_1.8-12 stringr_1.5.1
#> [109] BiocVersion_3.20.0 KernSmooth_2.23-24
#> [111] parallel_4.4.1 miniUI_0.1.1.1
#> [113] AnnotationDbi_1.68.0 pillar_1.9.0
#> [115] grid_4.4.1 vctrs_0.6.5
#> [117] RANN_2.6.2 promises_1.3.0
#> [119] dbplyr_2.5.0 xtable_1.8-4
#> [121] cluster_2.1.6 evaluate_1.0.1
#> [123] tinytex_0.53 magick_2.8.5
#> [125] cli_3.6.3 compiler_4.4.1
#> [127] rlang_1.1.4 crayon_1.5.3
#> [129] rstantools_2.4.0 future.apply_1.11.3
#> [131] labeling_0.4.3 plyr_1.8.9
#> [133] stringi_1.8.4 rstan_2.32.6
#> [135] viridisLite_0.4.2 deldir_2.0-4
#> [137] QuickJSR_1.4.0 Biostrings_2.74.0
#> [139] munsell_0.5.1 lazyeval_0.2.2
#> [141] spatstat.geom_3.3-3 V8_6.0.0
#> [143] Matrix_1.7-1 ExperimentHub_2.14.0
#> [145] RcppHNSW_0.6.0 patchwork_1.3.0
#> [147] bit64_4.5.2 future_1.34.0
#> [149] ggplot2_3.5.1 KEGGREST_1.46.0
#> [151] shiny_1.9.1 highr_0.11
#> [153] AnnotationHub_3.14.0 ROCR_1.0-11
#> [155] memoise_2.0.1 igraph_2.1.1
#> [157] RcppParallel_5.1.9 bslib_0.8.0
#> [159] bit_4.5.0