Basic Chapter 2 introduced the principles and methodology for scaling normalization of scRNA-seq data. This chapter provides some commentary on some miscellaneous theoretical aspects including the motivation for the pseudo-count, the use and benefits of downsampling instead of scaling, and some discussion of alternative transformations.
logNormCounts() will add a pseudo-count to avoid undefined values at zero.
Larger pseudo-counts will shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream high-dimensional analyses will be driven more by differences in expression for high-abundance genes.
Conversely, smaller pseudo-counts will increase the relative contribution of low-abundance genes.
Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation).
This works well in all but the most pathological scenarios (A. Lun 2018).
An interesting subtlety of
logNormCounts() is that it will center the size factors at unity, if they were not already.
This puts the normalized expression values on roughly the same scale as the original counts for easier interpretation.
For example, Figure 2.1 shows that interneurons have a median Snap25 log-expression from 5-6;
this roughly translates to an original count of 30-60 UMIs in each cell, which gives us some confidence that it is actually expressed.
This relationship to the original data would be less obvious - or indeed, lost altogether - if the centering were not performed.
#--- loading ---# library(scRNAseq) sce.zeisel <- ZeiselBrainData() library(scater) sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL") #--- quality-control ---# stats <- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType=="mito")) qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", "subsets_Mt_percent")) sce.zeisel <- sce.zeisel[,!qc$discard]
Centering also allows us to interpret a pseudo-count of 1 as an extra read or UMI for each gene. In practical terms, this means that the shrinkage effect of the pseudo-count diminishes as read/UMI counts increase. As a result, any estimates of log-fold changes in expression (e.g., from differences in the log-values between groups of cells) become increasingly accurate with deeper coverage. Conversely, at lower counts, stronger shrinkage avoids inflated differences due to sampling noise, which might otherwise mask interesting features in downstream analyses like clustering. In some sense, the pseudo-count aims to protect later analyses from the lack of information at low counts while trying to miminize its own effect at high counts.
For comparison, consider the situation where we applied a constant pseudo-count to some count-per-million-like measure. It is easy to see that the accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing was performed; scaling to a constant library size of a million means that the pseudo-count will have the same effect for all datasets. This is ironic given that the whole intention of sequencing more deeply is to improve quantification of these differences between cell subpopulations. The same criticism applies to popular metrics like the “counts-per-10K” used in, e.g., seurat.
In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 2.2 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019).
# TODO: move to scRNAseq. library(BiocFileCache) bfc <- BiocFileCache(ask=FALSE) qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true") env <- new.env() load(qcdata, envir=env) sce.8qc <- env$sce8_qc # Library size normalization and log-transformation. sce.8qc <- logNormCounts(sce.8qc) sce.8qc <- runPCA(sce.8qc) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=I(factor(sce.8qc$mix))), plotPCA(sce.8qc, colour_by=I(librarySizeFactors(sce.8qc))), ncol=2 )
As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 2.3). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells.
While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation.
It is worth noting the difference between normalization and batch correction (Multi-sample Chapter 1). Normalization typically refers to removal of technical biases between cells, while batch correction involves removal of both technical biases and biological differences between batches. Technical biases are relatively simple and straightforward to remove, whereas biological differences between batches can be highly unpredictable. On the other hand, batch correction algorithms can share information between cells in the same batch, as all cells in the same batch are assumed to be subject to the same batch effect, whereas most normalization strategies tend to operate on a cell-by-cell basis with less information sharing.
The key point here is that normalization and batch correction are different tasks, involve different assumptions and generally require different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to distinguish between “normalized” and “batch-corrected” data, as these usually refer to different stages of processing. Of course, these processes are not exclusive, and most workflows will perform normalization within each batch followed by correction between batches. Interested readers are directed to Multi-sample Chapter 1 for more details.
R version 4.2.0 RC (2022-04-21 r82226) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_GB LC_COLLATE=C  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats4 stats graphics grDevices utils datasets methods  base other attached packages:  BiocFileCache_2.5.0 dbplyr_2.1.1  scater_1.25.1 ggplot2_3.3.6  scuttle_1.7.2 SingleCellExperiment_1.19.0  SummarizedExperiment_1.27.1 Biobase_2.57.1  GenomicRanges_1.49.0 GenomeInfoDb_1.33.3  IRanges_2.31.0 S4Vectors_0.35.0  BiocGenerics_0.43.0 MatrixGenerics_1.9.0  matrixStats_0.62.0 BiocStyle_2.25.0  rebook_1.7.0 loaded via a namespace (and not attached):  bitops_1.0-7 bit64_4.0.5  httr_1.4.3 filelock_1.0.2  tools_4.2.0 bslib_0.3.1  utf8_1.2.2 R6_2.5.1  irlba_2.3.5 vipor_0.4.5  DBI_1.1.2 colorspace_2.0-3  withr_2.5.0 tidyselect_1.1.2  gridExtra_2.3 curl_4.3.2  bit_4.0.4 compiler_4.2.0  graph_1.75.0 cli_3.3.0  BiocNeighbors_1.15.0 DelayedArray_0.23.0  labeling_0.4.2 bookdown_0.26  sass_0.4.1 scales_1.2.0  rappdirs_0.3.3 stringr_1.4.0  digest_0.6.29 rmarkdown_2.14  XVector_0.37.0 pkgconfig_2.0.3  htmltools_0.5.2 sparseMatrixStats_1.9.0  highr_0.9 fastmap_1.1.0  rlang_1.0.2 RSQLite_2.2.14  DelayedMatrixStats_1.19.0 farver_2.1.0  jquerylib_0.1.4 generics_0.1.2  jsonlite_1.8.0 BiocParallel_1.31.4  dplyr_1.0.9 RCurl_1.98-1.6  magrittr_2.0.3 BiocSingular_1.13.0  GenomeInfoDbData_1.2.8 Matrix_1.4-1  Rcpp_188.8.131.52 ggbeeswarm_0.6.0  munsell_0.5.0 fansi_1.0.3  viridis_0.6.2 lifecycle_1.0.1  stringi_1.7.6 yaml_2.3.5  zlibbioc_1.43.0 blob_1.2.3  grid_4.2.0 parallel_4.2.0  ggrepel_0.9.1 crayon_1.5.1  dir.expiry_1.5.0 lattice_0.20-45  cowplot_1.1.1 beachmat_2.13.0  CodeDepends_0.6.5 knitr_1.39  pillar_1.7.0 codetools_0.2-18  ScaledMatrix_1.5.0 XML_3.99-0.9  glue_1.6.2 evaluate_0.15  BiocManager_1.30.18 vctrs_0.4.1  gtable_0.3.0 purrr_0.3.4  assertthat_0.2.1 cachem_1.0.6  xfun_0.31 rsvd_1.0.5  viridisLite_0.4.0 tibble_3.1.7  memoise_2.0.1 beeswarm_0.4.0  ellipsis_0.3.2
Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.
Tian, L., X. Dong, S. Freytag, K. A. Le Cao, S. Su, A. JalalAbadi, D. Amann-Zalcenstein, et al. 2019. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nat. Methods 16 (6): 479–87.