# Chapter 2 Normalization

## 2.1 Motivation

Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.

We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. To demonstrate, we will use the Zeisel et al. (2015) dataset from the scRNAseq package.

#--- loading ---#
library(scRNAseq)
sce.zeisel <- ZeiselBrainData()

library(scater)
sce.zeisel <- aggregateAcrossFeatures(sce.zeisel,
id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db,
keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL")

#--- quality-control ---#
stats <- perCellQCMetrics(sce.zeisel, subsets=list(
Mt=rowData(sce.zeisel)$featureType=="mito")) qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", "subsets_Mt_percent")) sce.zeisel <- sce.zeisel[,!qc$discard]
sce.zeisel 
## class: SingleCellExperiment
## dim: 19839 2816
## assays(1): counts
## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty
## rowData names(2): featureType Ensembl
## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01
##   1772066098_A12
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(2): ERCC repeat

## 2.2 Library size normalization

Library size normalization is the simplest strategy for performing scaling normalization. We define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. The “library size factor” for each cell is then directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This definition ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation - especially when dealing with transformed data (see Section 2.5).

library(scater)
lib.sf.zeisel <- librarySizeFactors(sce.zeisel)
summary(lib.sf.zeisel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.176   0.568   0.868   1.000   1.278   4.084

In the Zeisel brain data, the library size factors differ by up to 10-fold across cells (Figure 2.1). This is typical of the variability in coverage in scRNA-seq data.

hist(log10(lib.sf.zeisel), xlab="Log10[Size factor]", col='grey80')

Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This ensures that the library size is an unbiased estimate of the relative cell-specific bias by avoiding composition effects (Robinson and Oshlack 2010). However, balanced DE is not generally present in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses.

In practice, normalization accuracy is not a major consideration for exploratory scRNA-seq data analyses. Composition biases do not usually affect the separation of clusters, only the magnitude - and to a lesser extent, direction - of the log-fold changes between clusters or cell types. As such, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster.

## 2.3 Normalization by deconvolution

As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene $$X$$ is upregulated in one cell $$A$$ compared to the other cell $$B$$. This upregulation means that either (i) more sequencing resources are devoted to $$X$$ in $$A$$, thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification); or (ii) the library size of $$A$$ increases when $$X$$ is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in $$A$$ will incorrectly appear to be downregulated compared to $$B$$.

The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014) or with the calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package. These assume that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.

However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile. This is performed using the calculateSumFactors() function from scran, as shown below.

library(scran)
set.seed(100)
clust.zeisel <- quickCluster(sce.zeisel)
table(clust.zeisel)
## clust.zeisel
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14
## 170 254 441 178 393 148 219 240 189 123 112 103 135 111
deconv.sf.zeisel <- calculateSumFactors(sce.zeisel, cluster=clust.zeisel)
summary(deconv.sf.zeisel)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.119   0.486   0.831   1.000   1.321   4.509

We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed()) for reproducibility.

We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 2.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications.

plot(lib.sf.zeisel, deconv.sf.zeisel, xlab="Library size factor",
ylab="Deconvolution size factor", log='xy', pch=16,
col=as.integer(factor(sce.zeisel$level1class))) abline(a=0, b=1, col="red") Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure. ## 2.4 Normalization by spike-ins Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell (Lun et al. 2017). Systematic differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Compared to the previous methods, spike-in normalization requires no assumption about the biology of the system (i.e., the absence of many DE genes). Instead, it assumes that the spike-in transcripts were (i) added at a constant level to each cell, and (ii) respond to biases in the same relative manner as endogenous genes. Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, the other normalization methods described above will simply interpret any change in total RNA content as part of the bias and remove it. We demonstrate the use of spike-in normalization on a different dataset involving T cell activation after stimulation with T cell recepter ligands of varying affinity (Richard et al. 2018). library(scRNAseq) sce.richard <- RichardTCellData() sce.richard <- sce.richard[,sce.richard$single cell quality=="OK"]
sce.richard
## class: SingleCellExperiment
## dim: 46603 528
## assays(1): counts
## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(0):
## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ...
##   SLX-12612.i712_i522. SLX-12612.i714_i522.
## colData names(13): age individual ... stimulus time
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC

We apply the computeSpikeFactors() method to estimate spike-in size factors for all cells. This is defined by converting the total spike-in count per cell into a size factor, using the same reasoning as in librarySizeFactors(). Scaling will subsequently remove any differences in spike-in coverage across cells.

sce.richard <- computeSpikeFactors(sce.richard, "ERCC")
summary(sizeFactors(sce.richard))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.125   0.428   0.627   1.000   1.070  23.316

We observe a positive correlation between the spike-in size factors and deconvolution size factors within each treatment condition (Figure 2.3), indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, we also observe that increasing stimulation of the T cell receptor - in terms of increasing affinity or time - results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in biosynthetic activity and total RNA content during stimulation, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the library size factors).

to.plot <- data.frame(
DeconvFactor=calculateSumFactors(sce.richard),
SpikeFactor=sizeFactors(sce.richard),
Stimulus=sce.richard$stimulus, Time=sce.richard$time
)

ggplot(to.plot, aes(x=DeconvFactor, y=SpikeFactor, color=Time)) +
geom_point() + facet_wrap(~Stimulus) + scale_x_log10() +
scale_y_log10() + geom_abline(intercept=0, slope=1, color="red")

The differences between these two sets of size factors have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for Malat1 (Figure 2.4).

# See below for explanation of logNormCounts().
sce.richard.deconv <- logNormCounts(sce.richard, size_factors=to.plot$DeconvFactor) sce.richard.spike <- logNormCounts(sce.richard, size_factors=to.plot$SpikeFactor)

gridExtra::grid.arrange(
plotExpression(sce.richard.deconv, x="stimulus",
colour_by="time", features="ENSMUSG00000092341") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("After deconvolution"),
plotExpression(sce.richard.spike, x="stimulus",
colour_by="time", features="ENSMUSG00000092341") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("After spike-in normalization"),
ncol=2
)

Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out by applying the library size or deconvolution factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign.

However! Regardless of whether we care about total RNA content, it is critical that the spike-in transcripts are normalized using the spike-in size factors. Size factors computed from the counts for endogenous genes should not be applied to the spike-in transcripts, precisely because the former captures differences in total RNA content that are not experienced by the latter. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification. Thus, if normalized spike-in data is required, we must compute a separate set of size factors for the spike-in transcripts; this is automatically performed by functions such as modelGeneVarWithSpikes().

## 2.5 Scaling and log-transforming

Once we have computed the size factors, we use the logNormCounts() function from scater to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called "logcounts". (Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.) These log-values will be the basis of our downstream analyses in the following chapters.

set.seed(100)
clust.zeisel <- quickCluster(sce.zeisel)
sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clust.zeisel, min.mean=0.1)
sce.zeisel <- logNormCounts(sce.zeisel)
assayNames(sce.zeisel)
## [1] "counts"    "logcounts"

The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type $$A$$ and 10 in cell type $$B$$, or a gene that is expressed at an average count of 1100 in $$A$$ and 1000 in $$B$$? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences.

## Session Info

R Under development (unstable) (2021-11-10 r81171)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB              LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] ensembldb_2.19.2            AnnotationFilter_1.19.0
[3] GenomicFeatures_1.47.2      AnnotationDbi_1.57.1
[5] scRNAseq_2.9.1              scran_1.23.0
[7] scater_1.23.1               ggplot2_3.3.5
[9] scuttle_1.5.0               SingleCellExperiment_1.17.1
[11] SummarizedExperiment_1.25.2 Biobase_2.55.0
[13] GenomicRanges_1.47.3        GenomeInfoDb_1.31.1
[15] IRanges_2.29.0              S4Vectors_0.33.2
[17] BiocGenerics_0.41.1         MatrixGenerics_1.7.0
[19] matrixStats_0.61.0          BiocStyle_2.23.0
[21] rebook_1.5.0

loaded via a namespace (and not attached):
[1] AnnotationHub_3.3.3           BiocFileCache_2.3.0
[3] igraph_1.2.8                  lazyeval_0.2.2
[5] BiocParallel_1.29.1           digest_0.6.28
[7] htmltools_0.5.2               viridis_0.6.2
[9] fansi_0.5.0                   magrittr_2.0.1
[11] memoise_2.0.0                 ScaledMatrix_1.3.0
[13] cluster_2.1.2                 limma_3.51.0
[15] Biostrings_2.63.0             prettyunits_1.1.1
[17] colorspace_2.0-2              blob_1.2.2
[19] rappdirs_0.3.3                ggrepel_0.9.1
[21] xfun_0.28                     dplyr_1.0.7
[23] crayon_1.4.2                  RCurl_1.98-1.5
[25] jsonlite_1.7.2                graph_1.73.0
[27] glue_1.5.0                    gtable_0.3.0
[29] zlibbioc_1.41.0               XVector_0.35.0
[31] DelayedArray_0.21.1           BiocSingular_1.11.0
[33] scales_1.1.1                  DBI_1.1.1
[35] edgeR_3.37.0                  Rcpp_1.0.7
[37] viridisLite_0.4.0             xtable_1.8-4
[39] progress_1.2.2                dqrng_0.3.0
[41] bit_4.0.4                     rsvd_1.0.5
[43] metapod_1.3.0                 httr_1.4.2
[45] dir.expiry_1.3.0              ellipsis_0.3.2
[47] farver_2.1.0                  pkgconfig_2.0.3
[49] XML_3.99-0.8                  CodeDepends_0.6.5
[51] sass_0.4.0                    dbplyr_2.1.1
[53] locfit_1.5-9.4                utf8_1.2.2
[55] labeling_0.4.2                tidyselect_1.1.1
[57] rlang_0.4.12                  later_1.3.0
[59] munsell_0.5.0                 BiocVersion_3.15.0
[61] tools_4.2.0                   cachem_1.0.6
[63] generics_0.1.1                RSQLite_2.2.8
[65] ExperimentHub_2.3.1           evaluate_0.14
[67] stringr_1.4.0                 fastmap_1.1.0
[69] yaml_2.2.1                    knitr_1.36
[71] bit64_4.0.5                   purrr_0.3.4
[73] KEGGREST_1.35.0               sparseMatrixStats_1.7.0
[75] mime_0.12                     xml2_1.3.2
[77] biomaRt_2.51.0                compiler_4.2.0
[79] beeswarm_0.4.0                filelock_1.0.2
[81] curl_4.3.2                    png_0.1-7
[83] interactiveDisplayBase_1.33.0 tibble_3.1.6
[85] statmod_1.4.36                bslib_0.3.1
[87] stringi_1.7.5                 highr_0.9
[89] lattice_0.20-45               bluster_1.5.0
[91] ProtGenerics_1.27.0           Matrix_1.3-4
[93] vctrs_0.3.8                   pillar_1.6.4
[95] lifecycle_1.0.1               BiocManager_1.30.16
[97] jquerylib_0.1.4               BiocNeighbors_1.13.0
[99] cowplot_1.1.1                 bitops_1.0-7
[101] irlba_2.3.3                   httpuv_1.6.3
[103] rtracklayer_1.55.0            R6_2.5.1
[105] BiocIO_1.5.0                  bookdown_0.24
[107] promises_1.2.0.1              gridExtra_2.3
[109] vipor_0.4.5                   codetools_0.2-18
[111] assertthat_0.2.1              rjson_0.2.20
[113] withr_2.4.2                   GenomicAlignments_1.31.2
[115] Rsamtools_2.11.0              GenomeInfoDbData_1.2.7
[117] parallel_4.2.0                hms_1.1.1
[119] grid_4.2.0                    beachmat_2.11.0
[121] rmarkdown_2.11                DelayedMatrixStats_1.17.0
[123] shiny_1.7.1                   ggbeeswarm_0.6.0
[125] restfulr_0.0.13              

### References

Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April): 75.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.

Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.