Chapter 2 Normalization, redux
2.1 Overview
Basic Chapter 2 introduced the principles and methodology for scaling normalization of scRNA-seq data. This chapter provides some commentary on some miscellaneous theoretical aspects including the motivation for the pseudo-count, the use and benefits of downsampling instead of scaling, and some discussion of alternative transformations.
2.2 Scaling and the pseudo-count
When log-transforming, logNormCounts() will add a pseudo-count to avoid undefined values at zero.
Larger pseudo-counts will shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream high-dimensional analyses will be driven more by differences in expression for high-abundance genes.
Conversely, smaller pseudo-counts will increase the relative contribution of low-abundance genes.
Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation).
This works well in all but the most pathological scenarios (A. Lun 2018).
An interesting subtlety of logNormCounts() is that it will center the size factors at unity, if they were not already.
This puts the normalized expression values on roughly the same scale as the original counts for easier interpretation.
For example, Figure 2.1 shows that interneurons have a median Snap25 log-expression from 5-6;
this roughly translates to an original count of 30-60 UMIs in each cell, which gives us some confidence that it is actually expressed.
This relationship to the original data would be less obvious - or indeed, lost altogether - if the centering were not performed.
#--- loading ---#
library(scRNAseq)
sce.zeisel <- ZeiselBrainData()
library(scater)
sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, 
    id=sub("_loc[0-9]+$", "", rownames(sce.zeisel)))
#--- gene-annotation ---#
library(org.Mm.eg.db)
rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db, 
    keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL")
#--- quality-control ---#
stats <- perCellQCMetrics(sce.zeisel, subsets=list(
    Mt=rowData(sce.zeisel)$featureType=="mito"))
qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", 
    "subsets_Mt_percent"))
sce.zeisel <- sce.zeisel[,!qc$discard]library(scuttle)
library(scater)
sce.zeisel <- logNormCounts(sce.zeisel)
plotExpression(sce.zeisel, x="level1class", features="Snap25", colour="level1class") 
Figure 2.1: Distribution of log-expression values for Snap25 in each cell type of the Zeisel brain dataset.
Centering also allows us to interpret a pseudo-count of 1 as an extra read or UMI for each gene. In practical terms, this means that the shrinkage effect of the pseudo-count diminishes as read/UMI counts increase. As a result, any estimates of log-fold changes in expression (e.g., from differences in the log-values between groups of cells) become increasingly accurate with deeper coverage. Conversely, at lower counts, stronger shrinkage avoids inflated differences due to sampling noise, which might otherwise mask interesting features in downstream analyses like clustering. In some sense, the pseudo-count aims to protect later analyses from the lack of information at low counts while trying to miminize its own effect at high counts.
For comparison, consider the situation where we applied a constant pseudo-count to some count-per-million-like measure. It is easy to see that the accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing was performed; scaling to a constant library size of a million means that the pseudo-count will have the same effect for all datasets. This is ironic given that the whole intention of sequencing more deeply is to improve quantification of these differences between cell subpopulations. The same criticism applies to popular metrics like the “counts-per-10K” used in, e.g., seurat.
2.3 Downsampling instead of scaling
In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 2.2 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019).
# TODO: move to scRNAseq.
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true")
env <- new.env()
load(qcdata, envir=env)
sce.8qc <- env$sce8_qc
# Library size normalization and log-transformation.
sce.8qc <- logNormCounts(sce.8qc)
sce.8qc <- runPCA(sce.8qc)
gridExtra::grid.arrange(
    plotPCA(sce.8qc, colour_by=I(factor(sce.8qc$mix))),
    plotPCA(sce.8qc, colour_by=I(librarySizeFactors(sce.8qc))),
    ncol=2
) 
Figure 2.2: PCA plot of all pool-and-split libraries in the SORT-seq CellBench data, computed from the log-normalized expression values with library size-derived size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).
As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 2.3). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells.
sce.8qc2 <- logNormCounts(sce.8qc, downsample=TRUE)
sce.8qc2 <- runPCA(sce.8qc2)
gridExtra::grid.arrange(
    plotPCA(sce.8qc2, colour_by=I(factor(sce.8qc2$mix))),
    plotPCA(sce.8qc2, colour_by=I(librarySizeFactors(sce.8qc2))),
    ncol=2
) 
Figure 2.3: PCA plot of pool-and-split libraries in the SORT-seq CellBench data, computed from the log-transformed counts after downsampling in proportion to the library size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).
While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation.
2.5 Normalization versus batch correction
It is worth noting the difference between normalization and batch correction (Multi-sample Chapter 1). Normalization typically refers to removal of technical biases between cells, while batch correction involves removal of both technical biases and biological differences between batches. Technical biases are relatively simple and straightforward to remove, whereas biological differences between batches can be highly unpredictable. On the other hand, batch correction algorithms can share information between cells in the same batch, as all cells in the same batch are assumed to be subject to the same batch effect, whereas most normalization strategies tend to operate on a cell-by-cell basis with less information sharing.
The key point here is that normalization and batch correction are different tasks, involve different assumptions and generally require different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to distinguish between “normalized” and “batch-corrected” data, as these usually refer to different stages of processing. Of course, these processes are not exclusive, and most workflows will perform normalization within each batch followed by correction between batches. Interested readers are directed to Multi-sample Chapter 1 for more details.
Session Info
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     
other attached packages:
 [1] BiocFileCache_2.14.0        dbplyr_2.5.0               
 [3] scater_1.34.0               ggplot2_3.5.1              
 [5] scuttle_1.16.0              SingleCellExperiment_1.28.1
 [7] SummarizedExperiment_1.36.0 Biobase_2.66.0             
 [9] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1        
[11] IRanges_2.40.1              S4Vectors_0.44.0           
[13] BiocGenerics_0.52.0         MatrixGenerics_1.18.1      
[15] matrixStats_1.5.0           BiocStyle_2.34.0           
[17] rebook_1.16.0              
loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        viridisLite_0.4.2       blob_1.2.4             
 [4] dplyr_1.1.4             vipor_0.4.7             farver_2.1.2           
 [7] filelock_1.0.3          viridis_0.6.5           fastmap_1.2.0          
[10] XML_3.99-0.18           digest_0.6.37           rsvd_1.0.5             
[13] lifecycle_1.0.4         RSQLite_2.3.9           magrittr_2.0.3         
[16] compiler_4.4.2          rlang_1.1.5             sass_0.4.9             
[19] tools_4.4.2             yaml_2.3.10             knitr_1.49             
[22] S4Arrays_1.6.0          labeling_0.4.3          curl_6.1.0             
[25] bit_4.5.0.1             DelayedArray_0.32.0     abind_1.4-8            
[28] BiocParallel_1.40.0     purrr_1.0.2             withr_3.0.2            
[31] CodeDepends_0.6.6       grid_4.4.2              beachmat_2.22.0        
[34] colorspace_2.1-1        scales_1.3.0            cli_3.6.3              
[37] rmarkdown_2.29          crayon_1.5.3            generics_0.1.3         
[40] httr_1.4.7              DBI_1.2.3               ggbeeswarm_0.7.2       
[43] cachem_1.1.0            zlibbioc_1.52.0         parallel_4.4.2         
[46] BiocManager_1.30.25     XVector_0.46.0          vctrs_0.6.5            
[49] Matrix_1.7-1            jsonlite_1.8.9          dir.expiry_1.14.0      
[52] bookdown_0.42           BiocSingular_1.22.0     BiocNeighbors_2.0.1    
[55] bit64_4.6.0-1           ggrepel_0.9.6           irlba_2.3.5.1          
[58] beeswarm_0.4.0          jquerylib_0.1.4         glue_1.8.0             
[61] codetools_0.2-20        cowplot_1.1.3           gtable_0.3.6           
[64] UCSC.utils_1.2.0        ScaledMatrix_1.14.0     munsell_0.5.1          
[67] tibble_3.2.1            pillar_1.10.1           rappdirs_0.3.3         
[70] htmltools_0.5.8.1       graph_1.84.1            GenomeInfoDbData_1.2.13
[73] R6_2.5.1                evaluate_1.0.3          lattice_0.22-6         
[76] memoise_2.0.1           bslib_0.8.0             Rcpp_1.0.14            
[79] gridExtra_2.3           SparseArray_1.6.1       xfun_0.50              
[82] pkgconfig_2.0.3        References
Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.
Tian, L., X. Dong, S. Freytag, K. A. Le Cao, S. Su, A. JalalAbadi, D. Amann-Zalcenstein, et al. 2019. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nat. Methods 16 (6): 479–87.
2.4 Comments on other transformations
Of course, the log-transformation is not the only possible transformation. Another somewhat common choice is the square root, motivated by the fact that it is the variance stabilizing transformation for Poisson-distributed counts. This assumes that counts are actually Poisson-distributed, which is true enough from the perspective of sequencing noise in UMI counts but ignores biological overdispersion. One may also see the inverse hyperbolic sine (a.k.a, arcsinh) transformation being used on occasion, which is very similar to the log-transformation when considering non-negative values. The main practical difference for scRNA-seq applications is a larger initial jump from zero to non-zero values.
Alternatively, we may use more sophisticated approaches for variance stabilizing transformations in genomics data, e.g., DESeq2 or sctransform. These aim to remove the mean-variance trend more effectively than the simpler transformations mentioned above, though it could be argued whether this is actually desirable. For low-coverage scRNA-seq data, there will always be a mean-variance trend under any transformation, for the simple reason that the variance must be zero when the mean count is zero. These methods also face the challenge of removing the mean-variance trend while preserving the interesting component of variation, i.e., the log-fold changes between subpopulations; this may or may not be done adequately, depending on the aggressiveness of the algorithm.
In practice, the log-transformation is a good default choice due to its simplicity and interpretability, and is what we will be using for all downstream analyses.