Chapter 7 Normalization
Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.
At this point, it is worth noting the difference between normalization and batch correction (Chapter 28.8). Normalization occurs regardless of the batch structure and only considers technical biases, while batch correction - as the name suggests - only occurs across batches and must consider both technical biases and biological differences. Technical biases tend to affect genes in a similar manner, or at least in a manner related to their biophysical properties (e.g., length, GC content), while biological differences between batches can be highly unpredictable. As such, these two tasks involve different assumptions and generally involve different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to avoid conflating “normalized” and “batch-corrected” data, as these usually refer to different things.
We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. To demonstrate, we will use the Zeisel et al. (2015) dataset from the scRNAseq package.
#--- loading ---# library(scRNAseq) sce.zeisel <- ZeiselBrainData() library(scater) sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL") #--- quality-control ---# stats <- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType=="mito")) qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", "subsets_Mt_percent")) sce.zeisel <- sce.zeisel[,!qc$discard]
## class: SingleCellExperiment ## dim: 19839 2816 ## metadata(0): ## assays(1): counts ## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty ## rowData names(2): featureType Ensembl ## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(10): tissue group # ... level1class level2class ## reducedDimNames(0): ## altExpNames(2): ERCC repeat
7.2 Library size normalization
Library size normalization is the simplest strategy for performing scaling normalization. We define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. The “library size factor” for each cell is then directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This definition ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation - especially when dealing with transformed data (see Section 7.5.1).
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.176 0.568 0.868 1.000 1.278 4.084
In the Zeisel brain data, the library size factors differ by up to 10-fold across cells (Figure 7.1). This is typical of the variability in coverage in scRNA-seq data.
Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This ensures that the library size is an unbiased estimate of the relative cell-specific bias by avoiding composition effects (Robinson and Oshlack 2010). However, balanced DE is not generally present in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses.
In practice, normalization accuracy is not a major consideration for exploratory scRNA-seq data analyses. Composition biases do not usually affect the separation of clusters, only the magnitude - and to a lesser extent, direction - of the log-fold changes between clusters or cell types. As such, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster.
7.3 Normalization by deconvolution
As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene \(X\) is upregulated in one cell \(A\) compared to the other cell \(B\). This upregulation means that either (i) more sequencing resources are devoted to \(X\) in \(A\), thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification); or (ii) the library size of \(A\) increases when \(X\) is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in \(A\) will incorrectly appear to be downregulated compared to \(B\).
The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis.
Normalization can be performed with the
estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014) or with the
calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package.
These assume that most genes are not DE between cells.
Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.
However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts.
To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016).
Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile.
This is performed using the
calculateSumFactors() function from scran, as shown below.
## clust.zeisel ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 170 254 441 178 393 148 219 240 189 123 112 103 135 111
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.119 0.486 0.831 1.000 1.321 4.509
We use a pre-clustering step with
quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters.
This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations.
quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package.
The approximation relies on stochastic initialization so we need to set the random seed (via
set.seed()) for reproducibility.
We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 7.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications.
Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure.
7.4 Normalization by spike-ins
Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell (A. T. L. Lun et al. 2017). Systematic differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Compared to the previous methods, spike-in normalization requires no assumption about the biology of the system (i.e., the absence of many DE genes). Instead, it assumes that the spike-in transcripts were (i) added at a constant level to each cell, and (ii) respond to biases in the same relative manner as endogenous genes.
Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, the other normalization methods described above will simply interpret any change in total RNA content as part of the bias and remove it.
We demonstrate the use of spike-in normalization on a different dataset involving T cell activation after stimulation with T cell recepter ligands of varying affinity (Richard et al. 2018).
## class: SingleCellExperiment ## dim: 46603 528 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ... ## SLX-12612.i712_i522. SLX-12612.i714_i522. ## colData names(13): age individual ... stimulus time ## reducedDimNames(0): ## altExpNames(1): ERCC
We apply the
computeSpikeFactors() method to estimate spike-in size factors for all cells.
This is defined by converting the total spike-in count per cell into a size factor, using the same reasoning as in
Scaling will subsequently remove any differences in spike-in coverage across cells.
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.125 0.428 0.627 1.000 1.070 23.316
We observe a positive correlation between the spike-in size factors and deconvolution size factors within each treatment condition (Figure 7.3), indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, we also observe that increasing stimulation of the T cell receptor - in terms of increasing affinity or time - results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in biosynthetic activity and total RNA content during stimulation, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the library size factors).
to.plot <- data.frame( DeconvFactor=calculateSumFactors(sce.richard), SpikeFactor=sizeFactors(sce.richard), Stimulus=sce.richard$stimulus, Time=sce.richard$time ) ggplot(to.plot, aes(x=DeconvFactor, y=SpikeFactor, color=Time)) + geom_point() + facet_wrap(~Stimulus) + scale_x_log10() + scale_y_log10() + geom_abline(intercept=0, slope=1, color="red")
The differences between these two sets of size factors have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for Malat1 (Figure 7.4).
# See below for explanation of logNormCounts(). sce.richard.deconv <- logNormCounts(sce.richard, size_factors=to.plot$DeconvFactor) sce.richard.spike <- logNormCounts(sce.richard, size_factors=to.plot$SpikeFactor) gridExtra::grid.arrange( plotExpression(sce.richard.deconv, x="stimulus", colour_by="time", features="ENSMUSG00000092341") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("After deconvolution"), plotExpression(sce.richard.spike, x="stimulus", colour_by="time", features="ENSMUSG00000092341") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("After spike-in normalization"), ncol=2 )
Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out by applying the library size or deconvolution factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign.
Regardless of whether we care about total RNA content, it is critical that the spike-in transcripts are normalized using the spike-in size factors.
Size factors computed from the counts for endogenous genes should not be applied to the spike-in transcripts, precisely because the former captures differences in total RNA content that are not experienced by the latter.
Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification.
Thus, if normalized spike-in data is required, we must compute a separate set of size factors for the spike-in transcripts; this is automatically performed by functions such as
7.5 Applying the size factors
7.5.1 Scaling and log-transforming
Once we have computed the size factors, we use the
logNormCounts() function from scater to compute normalized expression values for each cell.
This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell.
The function also log-transforms the normalized values, creating a new assay called
(Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.)
These log-values will be the basis of our downstream analyses in the following chapters.
##  "counts" "logcounts"
The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type \(A\) and 10 in cell type \(B\), or a gene that is expressed at an average count of 1100 in \(A\) and 1000 in \(B\)? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences.
When log-transforming, we typically add a pseudo-count to avoid undefined values at zero. Larger pseudo-counts will effectively shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream high-dimensional analyses will be driven more by differences in expression for high-abundance genes. Conversely, smaller pseudo-counts will increase the relative contribution of low-abundance genes. Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation). This works well in all but the most pathological scenarios (A. Lun 2018).
Incidentally, the addition of the pseudo-count is the motivation for the centering of the size factors at unity. This ensures that both the pseudo-count and the normalized expression values are on the same scale; a pseudo-count of 1 can be interpreted as an extra read or UMI for each gene. In practical terms, centering means that the shrinkage effect of the pseudo-count diminishes as sequencing depth improves. This correctly ensures that estimates of the log-fold change in expression (e.g., from differences in the log-values between groups of cells) become increasingly accurate with deeper coverage. In contrast, if we applied a constant pseudo-count to some count-per-million-like measure, accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing we performed.
7.5.2 Downsampling and log-transforming
In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 7.5 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019).
# TODO: move to scRNAseq. library(BiocFileCache) bfc <- BiocFileCache(ask=FALSE) qcdata <- bfcrpath(bfc, "https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true") env <- new.env() load(qcdata, envir=env) sce.8qc <- env$sce8_qc # Library size normalization and log-transformation. sce.8qc <- logNormCounts(sce.8qc) sce.8qc <- runPCA(sce.8qc) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=I(factor(sce.8qc$mix))), plotPCA(sce.8qc, colour_by=I(librarySizeFactors(sce.8qc))), ncol=2 )
As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 7.6). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells.
While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation.
7.5.3 Other options
Of course, log-transformation is not the only possible transformation. More sophisticated approaches can be used such as dedicated variance stabilizing transformations (e.g., from the DESeq2 or sctransform packages), which out-perform the log-transformation for removal of the mean-variance trend. In practice, though, the log-transformation is a good default choice due to its simplicity (a.k.a., reliability, predictability and computational efficiency) and interpretability.
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=C  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  BiocFileCache_1.14.0 dbplyr_2.0.0  ensembldb_2.14.0 AnnotationFilter_1.14.0  GenomicFeatures_1.42.1 AnnotationDbi_1.52.0  scRNAseq_2.4.0 scran_1.18.1  scater_1.18.3 ggplot2_3.3.2  SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0  Biobase_2.50.0 GenomicRanges_1.42.0  GenomeInfoDb_1.26.1 IRanges_2.24.0  S4Vectors_0.28.0 BiocGenerics_0.36.0  MatrixGenerics_1.2.0 matrixStats_0.57.0  BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached):  ggbeeswarm_0.6.0 colorspace_2.0-0  ellipsis_0.3.1 scuttle_1.0.3  bluster_1.0.0 XVector_0.30.0  BiocNeighbors_1.8.2 farver_2.0.3  bit64_4.0.5 interactiveDisplayBase_1.28.0  xml2_1.3.2 codetools_0.2-18  sparseMatrixStats_1.2.0 knitr_1.30  Rsamtools_2.6.0 graph_1.68.0  shiny_1.5.0 BiocManager_1.30.10  compiler_4.0.3 httr_1.4.2  dqrng_0.2.1 lazyeval_0.2.2  assertthat_0.2.1 Matrix_1.2-18  fastmap_1.0.1 limma_3.46.0  later_22.214.171.124 BiocSingular_1.6.0  prettyunits_1.1.1 htmltools_0.5.0  tools_4.0.3 rsvd_1.0.3  igraph_1.2.6 gtable_0.3.0  glue_1.4.2 GenomeInfoDbData_1.2.4  dplyr_1.0.2 rappdirs_0.3.1  Rcpp_1.0.5 Biostrings_2.58.0  vctrs_0.3.5 rtracklayer_1.50.0  ExperimentHub_1.16.0 DelayedMatrixStats_1.12.1  xfun_0.19 stringr_1.4.0  ps_1.5.0 beachmat_2.6.2  mime_0.9 lifecycle_0.2.0  irlba_2.3.3 statmod_1.4.35  XML_3.99-0.5 AnnotationHub_2.22.0  edgeR_3.32.0 zlibbioc_1.36.0  scales_1.1.1 ProtGenerics_1.22.0  hms_0.5.3 promises_1.1.1  yaml_2.2.1 curl_4.3  memoise_1.1.0 gridExtra_2.3  biomaRt_2.46.0 stringi_1.5.3  RSQLite_2.2.1 BiocVersion_3.12.0  highr_0.8 BiocParallel_1.24.1  rlang_0.4.9 pkgconfig_2.0.3  bitops_1.0-6 evaluate_0.14  lattice_0.20-41 purrr_0.3.4  labeling_0.4.2 GenomicAlignments_1.26.0  CodeDepends_0.6.5 cowplot_1.1.0  bit_4.0.4 processx_3.4.5  tidyselect_1.1.0 magrittr_2.0.1  bookdown_0.21 R6_2.5.0  generics_0.1.0 DelayedArray_0.16.0  DBI_1.1.0 pillar_1.4.7  withr_2.3.0 RCurl_1.98-1.2  tibble_3.0.4 crayon_1.3.4  rmarkdown_2.5 progress_1.2.2  viridis_0.5.1 locfit_1.5-9.4  grid_4.0.3 blob_1.2.1  callr_3.5.1 digest_0.6.27  xtable_1.8-4 httpuv_1.5.4  openssl_1.4.3 munsell_0.5.0  beeswarm_0.2.3 viridisLite_0.3.0  vipor_0.4.5 askpass_1.1
Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.
Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.
Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April): 75.
Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.
Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.
Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.
Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45.
Tian, L., X. Dong, S. Freytag, K. A. Le Cao, S. Su, A. JalalAbadi, D. Amann-Zalcenstein, et al. 2019. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nat. Methods 16 (6): 479–87.
Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.