1 Introduction

Sanity (SAmpling-Noise-corrected Inference of Transcription activitY) provides a biophysics-inspired normalization method for single-cell RNA-seq data. It uses a Bayesian framework with minimal number of assumptions in order to model physically meaningful quantities and their corresponding uncertainties, facilitating comparisons between cells and experiments. The full details of the method are described in the original publication and its supplementary (Breda, Zavolan, and Nimwegen 2021).

2 Installation

To install this package, start R (version “4.5”) and enter:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
    
BiocManager::install("SanityR")

3 Quick start

The basic functionality of the SanityR package is to compute the logcounts assay from UMI counts of a SingleCellExperiment object. It thus acts as a replacement for the logNormCounts from scuttle on any other function with similar functionality.

library(scuttle)
library(SanityR)

sce <- mockSCE()
sce <- computeLibraryFactors(sce) # or any other method to compute size factors
sce <- Sanity(sce)
logcounts(sce)

4 Simulate single-cell data

To demonstrate the functionality of SanityR, we first simulate a single-cell dataset using the method described in the Sanity paper. In short, the data are generated by sampling from the Bayesian prior of the model and the correlation between the cells’ gene expression is generated by simulating a random walk process in the gene expression space so that the mean value of transcriptional activity in one cell is close to that of its predecessor in the walk. The number of cells, genes and the other prior parameters are tuned to match the dataset generated in (Baron et al. 2016). Here we down-sample the number of genes and cells for speed.

SanityR is designed to work with the SingleCellExperiment class and thus the simulation methods return a SingleCellExperiment object. The latent (“true”) log normalized counts are stored in the logFC assay so as to not interfere with the logcounts assay.

library(SingleCellExperiment)
library(SanityR)

set.seed(42)
sce <- simulate_branched_random_walk(
    N_gene=1000,  
    N_path=10, 
    length_path=12
)
sce
#> class: SingleCellExperiment 
#> dim: 981 120 
#> metadata(0):
#> assays(2): counts logFC
#> rownames(981): Gene_1 Gene_2 ... Gene_999 Gene_1000
#> rowData names(2): ltq_mean ltq_var
#> colnames(120): Cell_1 Cell_2 ... Cell_119 Cell_120
#> colData names(2): cell_size predecessor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Genes with no UMI in any of the cells are removed that is why the number of rows may be less than 1000.

Besides the logFC assay, the following parameters are stored in the sce to fully specify the model:

ltq_mean: the mean of the log transcriptional activity, defined as the quotient of the number of molecules transcribed from each gene versus the total number of molecules in the cell.
ltq_var: the variance of the log transcriptional activity.
cell_size: the total number of molecules in the cell.
predecessor: the index of the cell that precedes of the current cell in the random walk. Thus if predecessor[n] == m, then \(m \to n\) in the random walk.

5 Running Sanity normalization

The Sanity function is the main function of the package. When called on a SingleCellExperiment object, it will perform normalization for each gene and independently and store the results in the logcounts assay.

sf <- colSums(counts(sce))
sizeFactors(sce) <- sf / mean(sf)
sce <- Sanity(sce)
sce
#> class: SingleCellExperiment 
#> dim: 981 120 
#> metadata(0):
#> assays(4): counts logFC logcounts logcounts_sd
#> rownames(981): Gene_1 Gene_2 ... Gene_999 Gene_1000
#> rowData names(6): ltq_mean ltq_var ... sanity_activity_sd
#>   sanity_entropy
#> colnames(120): Cell_1 Cell_2 ... Cell_119 Cell_120
#> colData names(3): cell_size predecessor sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Below we visualize the accuracy of the Sanity normalization method by partially reproducing Figure 3A of the original publication. First, at estimating the overall mean and variance of a gene expression.

par(mfrow=c(1, 2))
plot(
    sanity_log_activity_mean ~ ltq_mean, 
    data=rowData(sce),
    frame=FALSE, 
    pch=16, 
    col="#00000044", 
    xlab="Mean UMI counts",  
    ylab="Sanity Estimate", 
    main="Mean Log Transcriptional Activity"
)
abline(0, 1, col=2, lwd=2, lty=2)
legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n")

plot(
    sanity_activity_sd^2  ~ ltq_var, data=rowData(sce), 
    log="xy",
    frame=FALSE, 
    pch=16, 
    col="#00000044", 
    xlab="True LTQ Variance",  
    ylab="Sanity Estimate", 
    main="Variance of Log Transcriptional Activity"
)
abline(0, 1, col=2, lwd=2, lty=2)
legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n")

(#fig:sanity_accuracy)Gene Level Parameters. Comparison of Sanity estimates with latent values used to generate the data.

Second, at a cell-level by calculating the correlation of latent and inferred logcounts using different using genes of different expression level.

umi_mean <- rowMeans(counts(sce))
umi_quantiles <- quantile(umi_mean, seq(0, 1, .2))
umi_quantiles <- cut( 
    umi_mean, 
    umi_quantiles, 
    include.lowest=TRUE, 
    labels=paste0("Q", 1:5)
)
rho <- sapply(
    split(1:nrow(sce), umi_quantiles), 
    function(ix) { 
        sce_q <- sce[ix,] 
        diag(cor(assay(sce_q, "logFC"), logcounts(sce_q))) 
    } 
)

boxplot(
    rho, 
    frame=FALSE, 
    main="Correlation between latent and inferred logcounts", 
    xlab="Quantiles of Mean UMI Count", 
    ylab="Pearson Correlation"
)

Cell Level Estimates. Correlation of Sanity inferred logcounts with latent logcounts using genes at different quantiles of expression.

(#fig:sanity_correlation)Cell Level Estimates. Correlation of Sanity inferred logcounts with latent logcounts using genes at different quantiles of expression.

6 Compute cell distances

The logcounts calculated by Sanity can be used for downstream analyses. Typically, these analyses involve computing distances between cells which can be done by using the functions that BiocNeighbors provides. However, SanityR provides a more principled approach that takes into account the uncertainty of the estimated log normalized counts. This is implemented in the calculateSanityDistance function.

Below is an example of using the distances calculated by calculateSanityDistance as input to Rtsne to embed the cells in a 2D space which captures the underlying random walk structure that generated the data.

cell_dist <- calculateSanityDistance(sce)
reducedDim(sce, "TSNE") <- Rtsne::Rtsne(cell_dist, is_distance=TRUE)$Y

# Visualize the t-SNE embedding
sce$pseudotime <- 0
for (i in 2:ncol(sce))  # cell 1 is the stem cell
    sce$pseudotime[i] <- sce$pseudotime[sce$predecessor[i]] + 1
scater::plotTSNE(sce, color_by="pseudotime")

t-SNE visualization using the Sanity calculated distances. Pseudotime is computed based the predecessor relations between cells.

(#fig:calculate_distance)t-SNE visualization using the Sanity calculated distances. Pseudotime is computed based the predecessor relations between cells.

Unlike many functions of BiocNeighbors111 which compute only the k nearest neighbors, calculateSanityDistance computes all the pairwise distances and thus is more resource intensive, both in terms of memory and time. To ease the computational burden, the user can limit the number of genes involved in the calculations by filtering only the highly variable ones. calculateSanityDistance uses the signal-to-noise ratio (snr_cutoff) internally to filter the genes.

References

Baron, Maayan, Adrian Veres, Samuel L Wolock, Aubrey L Faust, Renaud Gaujoux, Amedeo Vetere, Jennifer Hyoje Ryu, et al. 2016. “A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter-and Intra-Cell Population Structure.” Cell Systems 3 (4): 346–60.

Breda, Jérémie, Mihaela Zavolan, and Erik van Nimwegen. 2021. “Bayesian Inference of Gene Expression States from Single-Cell Rna-Seq Data.” Nature Biotechnology 39 (8): 1008–16.

Appendix

A Session info

sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SanityR_0.99.3              scuttle_1.19.0             
#>  [3] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
#>  [5] Biobase_2.69.1              GenomicRanges_1.61.5       
#>  [7] Seqinfo_0.99.2              IRanges_2.43.5             
#>  [9] S4Vectors_0.47.4            BiocGenerics_0.55.1        
#> [11] generics_0.1.4              MatrixGenerics_1.21.0      
#> [13] matrixStats_1.5.0           BiocStyle_2.37.1           
#> 
#> loaded via a namespace (and not attached):
#>  [1] beeswarm_0.4.0      gtable_0.3.6        xfun_0.53          
#>  [4] bslib_0.9.0         ggplot2_4.0.0       ggrepel_0.9.6      
#>  [7] lattice_0.22-7      vctrs_0.6.5         tools_4.5.1        
#> [10] parallel_4.5.1      tibble_3.3.0        pkgconfig_2.0.3    
#> [13] BiocNeighbors_2.3.1 Matrix_1.7-4        RColorBrewer_1.1-3 
#> [16] S7_0.2.0            lifecycle_1.0.4     compiler_4.5.1     
#> [19] farver_2.1.2        tinytex_0.57        codetools_0.2-20   
#> [22] vipor_0.4.7         htmltools_0.5.8.1   sass_0.4.10        
#> [25] yaml_2.3.10         pillar_1.11.1       crayon_1.5.3       
#> [28] jquerylib_0.1.4     BiocParallel_1.43.4 DelayedArray_0.35.3
#> [31] cachem_1.1.0        viridis_0.6.5       magick_2.9.0       
#> [34] abind_1.4-8         tidyselect_1.2.1    rsvd_1.0.5         
#> [37] digest_0.6.37       Rtsne_0.17          BiocSingular_1.25.0
#> [40] dplyr_1.1.4         bookdown_0.45       labeling_0.4.3     
#> [43] cowplot_1.2.0       fastmap_1.2.0       grid_4.5.1         
#> [46] cli_3.6.5           SparseArray_1.9.1   scater_1.37.0      
#> [49] magrittr_2.0.4      S4Arrays_1.9.1      dichromat_2.0-0.1  
#> [52] withr_3.0.2         scales_1.4.0        ggbeeswarm_0.7.2   
#> [55] rmarkdown_2.30      XVector_0.49.1      gridExtra_2.3      
#> [58] ScaledMatrix_1.17.0 beachmat_2.25.5     evaluate_1.0.5     
#> [61] knitr_1.50          viridisLite_0.4.2   irlba_2.3.5.1      
#> [64] rlang_1.1.6         Rcpp_1.1.0          glue_1.8.0         
#> [67] BiocManager_1.30.26 jsonlite_2.0.0      R6_2.6.1

Normalizing scRNA-seq data with Sanity

2025-10-07

Package