Chapter 4 Dimensionality reduction, redux

4.1 Overview

Basic Chapter 4 introduced the key concepts for dimensionality reduction of scRNA-seq data. Here, we describe some data-driven strategies for picking an appropriate number of top PCs for downstream analyses. We also demonstrate some other dimensionality reduction strategies that operate on the raw counts. For the most part, we will be again using the Zeisel et al. (2015) dataset:

4.2 More choices for the number of PCs

4.2.1 Using the elbow point

A simple heuristic for choosing the suitable number of PCs $$d$$ involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot as shown in Figure 4.1.

##  7 Figure 4.1: Percentage of variance explained by successive PCs in the Zeisel brain data. The identified elbow point is marked with a red line.

Our assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for $$d$$. Once this is identified, we can subset the reducedDims() entry to only retain the first $$d$$ PCs of interest.

##  "PCA"       "PCA.elbow"

From a practical perspective, the use of the elbow point tends to retain fewer PCs compared to other methods. The definition of “much more variance” is relative so, in order to be retained, later PCs must explain a amount of variance that is comparable to that explained by the first few PCs. Strong biological variation in the early PCs will shift the elbow to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following the elbow.

4.2.2 Using the technical noise

Another strategy is to retain all PCs until the percentage of total variation explained reaches some threshold $$T$$. For example, we might retain the top set of PCs that explains 80% of the total variation in the data. Of course, it would be pointless to swap one arbitrary parameter $$d$$ for another $$T$$. Instead, we derive a suitable value for $$T$$ by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the denoisePCA() function with the variance modelling results from modelGeneVarWithSpikes() or related functions, where $$T$$ is defined as the ratio of the sum of the biological components to the sum of total variances. To illustrate, we use this strategy to pick the number of PCs in the 10X PBMC dataset.

##  9

The dimensionality of the output represents the lower bound on the number of PCs required to retain all biological variation. This choice of $$d$$ is motivated by the fact that any fewer PCs will definitely discard some aspect of biological signal. (Of course, the converse is not true; there is no guarantee that the retained PCs capture all of the signal, which is only generally possible if no dimensionality reduction is performed at all.) From a practical perspective, the denoisePCA() approach usually retains more PCs than the elbow point method as the former does not compare PCs to each other and is less likely to discard PCs corresponding to secondary factors of variation. The downside is that many minor aspects of variation may not be interesting (e.g., transcriptional bursting) and their retention would only add irrelevant noise.

Note that denoisePCA() imposes internal caps on the number of PCs that can be chosen in this manner. By default, the number is bounded within the “reasonable” limits of 5 and 50 to avoid selection of too few PCs (when technical noise is high relative to biological variation) or too many PCs (when technical noise is very low). For example, applying this function to the Zeisel brain data hits the upper limit:

##  50

This method also tends to perform best when the mean-variance trend reflects the actual technical noise, i.e., estimated by modelGeneVarByPoisson() or modelGeneVarWithSpikes() instead of modelGeneVar() (Basic Section 3.3). Variance modelling results from modelGeneVar() tend to understate the actual biological variation, especially in highly heterogeneous datasets where secondary factors of variation inflate the fitted values of the trend. Fewer PCs are subsequently retained because $$T$$ is artificially lowered, as evidenced by denoisePCA() returning the lower limit of 5 PCs for the PBMC dataset:

##  5

4.2.3 Based on population structure

Yet another method to choose $$d$$ uses information about the number of subpopulations in the data. Consider a situation where each subpopulation differs from the others along a different axis in the high-dimensional space (e.g., because it is defined by a unique set of marker genes). This suggests that we should set $$d$$ to the number of unique subpopulations minus 1, which guarantees separation of all subpopulations while retaining as few dimensions (and noise) as possible. We can use this reasoning to loosely motivate an a priori choice for $$d$$ - for example, if we expect around 10 different cell types in our population, we would set $$d \approx 10$$.

In practice, the number of subpopulations is usually not known in advance. Rather, we use a heuristic approach that uses the number of clusters as a proxy for the number of subpopulations. We perform clustering (graph-based by default, see Basic Section 5.2) on the first $$d^*$$ PCs and only consider the values of $$d^*$$ that yield no more than $$d^*+1$$ clusters. If we detect more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, assuming that multiple subpopulations should not be distinguishable on the same axes. We test a range of $$d^*$$ and set $$d$$ to the value that maximizes the number of clusters while satisfying the above condition. This attempts to capture as many distinct (putative) subpopulations as possible by retaining biological signal in later PCs, up until the point that the additional noise reduces resolution. Figure 4.2: Number of clusters detected in the Zeisel brain dataset as a function of the number of PCs. The red unbroken line represents the theoretical upper constraint on the number of clusters, while the grey dashed line is the number of PCs suggested by getClusteredPCs().

We subset the PC matrix by column to retain the first $$d$$ PCs and assign the subsetted matrix back into our SingleCellExperiment object. Downstream applications that use the "PCA.clust" results in sce.zeisel will subsequently operate on the chosen PCs only.

This strategy is pragmatic as it directly addresses the role of the bias-variance trade-off in downstream analyses, specifically clustering. There is no need to preserve biological signal beyond what is distinguishable in later steps. However, it involves strong assumptions about the nature of the biological differences between subpopulations - and indeed, discrete subpopulations may not even exist in studies of continuous processes like differentiation. It also requires repeated applications of the clustering procedure on increasing number of PCs, which may be computational expensive.

4.2.4 Using random matrix theory

We consider the observed (log-)expression matrix to be the sum of (i) a low-rank matrix containing the true biological signal for each cell and (ii) a random matrix representing the technical noise in the data. Under this interpretation, we can use random matrix theory to guide the choice of the number of PCs based on the properties of the noise matrix.

The Marchenko-Pastur (MP) distribution defines an upper bound on the singular values of a matrix with random i.i.d. entries. Thus, all PCs associated with larger singular values are likely to contain real biological structure - or at least, signal beyond that expected by noise - and should be retained (Shekhar et al. 2016). We can implement this scheme using the chooseMarchenkoPastur() function from the PCAtools package, given the dimensionality of the matrix used for the PCA (noting that we only used the HVG subset); the variance explained by each PC (not the percentage); and the variance of the noise matrix derived from our previous variance decomposition results.

##  144
## attr(,"limit")
##  2.336

We can then subset the PC coordinate matrix by the first mp.choice columns as previously demonstrated. It is best to treat this as a guideline only; PCs below the MP limit are not necessarily uninteresting, especially in noisy datasets where the higher noise drives a more aggressive choice of $$d$$. Conversely, many PCs above the limit may not be relevant if they are driven by uninteresting biological processes like transcriptional bursting, cell cycle or metabolic variation. Morever, the use of the MP distribution is not entirely justified here as the noise distribution differs by abundance for each gene and by sequencing depth for each cell.

In a similar vein, Horn’s parallel analysis is commonly used to pick the number of PCs to retain in factor analysis. This involves randomizing the input matrix, repeating the PCA and creating a scree plot of the PCs of the randomized matrix. The desired number of PCs is then chosen based on the intersection of the randomized scree plot with that of the original matrix (Figure 4.3). Here, the reasoning is that PCs are unlikely to be interesting if they explain less variance that that of the corresponding PC of a random matrix. Note that this differs from the MP approach as we are not using the upper bound of randomized singular values to threshold the original PCs.

##  26 Figure 4.3: Percentage of variance explained by each PC in the original matrix (black) and the PCs in the randomized matrix (grey) across several randomization iterations. The red line marks the chosen number of PCs.

The parallelPCA() function helpfully emits the PC coordinates in horn$original$rotated, which we can subset by horn\$n and add to the reducedDims() of our SingleCellExperiment. Parallel analysis is reasonably intuitive (as random matrix methods go) and avoids any i.i.d. assumption across genes. However, its obvious disadvantage is the not-insignificant computational cost of randomizing and repeating the PCA. One can also debate whether the scree plot of the randomized matrix is even comparable to that of the original, given that the former includes biological variation and thus cannot be interpreted as purely technical noise. This manifests in Figure 4.3 as a consistently higher curve for the randomized matrix due to the redistribution of biological variation to the later PCs.

Another approach is based on optimizing the reconstruction error of the low-rank representation (Gavish and Donoho 2014). Recall that PCA produces both the matrix of per-cell coordinates and a rotation matrix of per-gene loadings, the product of which recovers the original log-expression matrix. If we subset these two matrices to the first $$d$$ dimensions, the product of the resulting submatrices serves as an approximation of the original matrix. Under certain conditions, the difference between this approximation and the true low-rank signal (i.e., sans the noise matrix) has a defined mininum at a certain number of dimensions. This minimum can be defined using the chooseGavishDonoho() function from PCAtools as shown below.

##  59
## attr(,"limit")
##  3.121

The Gavish-Donoho method is appealing as, unlike the other approaches for choosing $$d$$, the concept of the optimum is rigorously defined. By minimizing the reconstruction error, we can most accurately represent the true biological variation in terms of the distances between cells in PC space. However, there remains some room for difference between “optimal” and “useful”; for example, noisy datasets may find themselves with very low $$d$$ as including more PCs will only ever increase reconstruction error, regardless of whether they contain relevant biological variation. This approach is also dependent on some strong i.i.d. assumptions about the noise matrix.

4.3 Count-based dimensionality reduction

For count matrices, correspondence analysis (CA) is a natural approach to dimensionality reduction. In this procedure, we compute an expected value for each entry in the matrix based on the per-gene abundance and size factors. Each count is converted into a standardized residual in a manner analogous to the calculation of the statistic in Pearson’s chi-squared tests, i.e., subtraction of the expected value and division by its square root. An SVD is then applied on this matrix of residuals to obtain the necessary low-dimensional coordinates for each cell. To demonstrate, we use the corral package to compute CA factors for the Zeisel dataset.

##  2816   30

The major advantage of CA is that it avoids difficulties with the mean-variance relationship upon transformation (Figure 2.2). If two cells have the same expression profile but differences in their total counts, CA will return the same expected location for both cells; this avoids artifacts observed in PCA on log-transformed counts (Figure 4.4). However, CA is more sensitive to overdispersion in the random noise due to the nature of its standardization. This may cause some problems in some datasets where the CA factors may be driven by a few genes with random expression rather than the underlying biological structure.

## class: SingleCellExperiment
## dim: 15571 296
## assays(1): counts
## rownames(15571): ENSG00000245025 ENSG00000257433 ... ENSG00000233117
##   ENSG00000115687
## rowData names(0):
## colnames(296): L19 A10 ... P8 P9
## colData names(21): unaligned aligned_unmapped ... HCC827_prop
##   mRNA_amount
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0): Figure 4.4: Dimensionality reduction results of all pool-and-split libraries in the SORT-seq CellBench data, computed by a PCA on the log-normalized expression values (left) or using the corral package (right). Each point represents a library and is colored by the mixing ratio used to construct it.

4.4 More visualization methods

4.4.1 Fast interpolation-based $$t$$-SNE

Conventional $$t$$-SNE algorithms scale poorly with the number of cells. Fast interpolation-based $$t$$-SNE (FIt-SNE) (Linderman et al. 2019) is an alternative algorithm that reduces the computational complexity of the calculations from $$N\log N$$ to $$\sim 2 p N$$. This is achieved by using interpolation nodes in the high-dimensional space; the bulk of the calculations are performed on the nodes and the embedding of individual cells around each node is determined by interpolation. To use this method, we can simply set use_fitsne=TRUE when calling runTSNE() with scater - this calls the snifter package, which in turn wraps the Python library openTSNE using basilisk
As Figure 4.5 shows, the embeddings produced by this method are qualitatively similar to those produced by other algorithms, supported by some theoretical results from Linderman et al. (2019) showing that any difference from conventional $$t$$-SNE implementations is low and bounded. Figure 4.5: FI-tSNE embedding and Barnes-Hut $$t$$-SNE embeddings for the Zeisel brain data.

By using snifter directly, we can also take advantage of openTSNE’s ability to project new points into an existing embedding. In this process, the existing points remain static while new points are inserted based on their affinities with each other and the points in the existing embedding. For example, cells are generally projected near to cells of a similar type in Figure 4.6. This may be useful as an exploratory step when combining datasets, though the projection may not be sensible for cell types that are not present in the existing embedding. Figure 4.6: $$t$$-SNE embedding created with snifter, using 80% of the cells in the Zeisel brain data. The remaining 20% of the cells were projected into this pre-existing embedding.

4.4.2 Density-preserving $$t$$-SNE and UMAP

One downside of t$$-$$SNE and UMAP is that they preserve the neighbourhood structure of the data while neglecting the local density of the data. This can result in seemingly compact clusters on a t-SNE or UMAP plot that correspond to very heterogeneous groups in the original data. The dens-SNE and densMAP algorithms mitigate this effect by incorporating information about the average distance to the nearest neighbours when creating the embedding (Narayan 2021). We demonstrate below by applying these approaches on the PCs of the Zeisel dataset using the densviz wrapper package.

These methods provide more information about transcriptional heterogeneity within clusters (Figure 4.7), with the astrocyte cluster being less compact in the density-preserving versions. This excessive compactness can imply a lower level of within-population heterogeneity. Figure 4.7: $$t$$-SNE, UMAP, dens-SNE and densMAP embeddings for the Zeisel brain data.

Session Info

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so

locale:
 LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 LC_TIME=en_GB              LC_COLLATE=C
 LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 LC_PAPER=en_US.UTF-8       LC_NAME=C
 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 stats4    stats     graphics  grDevices utils     datasets  methods
 base

other attached packages:
 densvis_1.4.0               snifter_1.4.0
 scater_1.22.0               BiocFileCache_2.2.0
 dbplyr_2.1.1                corral_1.4.0
 PCAtools_2.6.0              ggrepel_0.9.1
 ggplot2_3.3.5               scran_1.22.0
 scuttle_1.4.0               SingleCellExperiment_1.16.0
 SummarizedExperiment_1.24.0 Biobase_2.54.0
 GenomicRanges_1.46.0        GenomeInfoDb_1.30.0
 IRanges_2.28.0              S4Vectors_0.32.2
 BiocGenerics_0.40.0         MatrixGenerics_1.6.0
 matrixStats_0.61.0          BiocStyle_2.22.0
 rebook_1.4.0

loaded via a namespace (and not attached):
 plyr_1.8.6                  igraph_1.2.8
 BiocParallel_1.28.0         digest_0.6.28
 htmltools_0.5.2             viridis_0.6.2
 fansi_0.5.0                 magrittr_2.0.1
 RMTstat_0.3                 memoise_2.0.0
 ScaledMatrix_1.2.0          cluster_2.1.2
 limma_3.50.0                colorspace_2.0-2
 blob_1.2.2                  rappdirs_0.3.3
 xfun_0.28                   dplyr_1.0.7
 crayon_1.4.2                RCurl_1.98-1.5
 jsonlite_1.7.2              graph_1.72.0
 glue_1.5.0                  pals_1.7
 gtable_0.3.0                zlibbioc_1.40.0
 XVector_0.34.0              DelayedArray_0.20.0
 BiocSingular_1.10.0         maps_3.4.0
 scales_1.1.1                DBI_1.1.1
 edgeR_3.36.0                ggthemes_4.2.4
 Rcpp_1.0.7                  viridisLite_0.4.0
 reticulate_1.22             dqrng_0.3.0
 bit_4.0.4                   rsvd_1.0.5
 mapproj_1.2.7               metapod_1.2.0
 httr_1.4.2                  FNN_1.1.3
 RColorBrewer_1.1-2          dir.expiry_1.2.0
 ellipsis_0.3.2              pkgconfig_2.0.3
 XML_3.99-0.8                farver_2.1.0
 uwot_0.1.10                 CodeDepends_0.6.5
 sass_0.4.0                  here_1.0.1
 locfit_1.5-9.4              utf8_1.2.2
 tidyselect_1.1.1            labeling_0.4.2
 rlang_0.4.12                reshape2_1.4.4
 munsell_0.5.0               tools_4.1.2
 cachem_1.0.6                generics_0.1.1
 RSQLite_2.2.8               evaluate_0.14
 stringr_1.4.0               fastmap_1.1.0
 yaml_2.2.1                  transport_0.12-2
 knitr_1.36                  bit64_4.0.5
 purrr_0.3.4                 sparseMatrixStats_1.6.0
 compiler_4.1.2              beeswarm_0.4.0
 filelock_1.0.2              curl_4.3.2
 png_0.1-7                   tibble_3.1.6
 statmod_1.4.36              bslib_0.3.1
 stringi_1.7.5               highr_0.9
 basilisk.utils_1.6.0        RSpectra_0.16-0
 lattice_0.20-45             bluster_1.4.0
 Matrix_1.3-4                vctrs_0.3.8
 pillar_1.6.4                lifecycle_1.0.1
 BiocManager_1.30.16         jquerylib_0.1.4
 BiocNeighbors_1.12.0        data.table_1.14.2
 cowplot_1.1.1               bitops_1.0-7
 irlba_2.3.3                 R6_2.5.1
 bookdown_0.24               gridExtra_2.3
 vipor_0.4.5                 codetools_0.2-18
 dichromat_2.0-0             assertthat_0.2.1
 rprojroot_2.0.2             withr_2.4.2
 GenomeInfoDbData_1.2.7      parallel_4.1.2
 MultiAssayExperiment_1.20.0 grid_4.1.2
 beachmat_2.10.0             basilisk_1.6.0
 rmarkdown_2.11              DelayedMatrixStats_1.16.0
 Rtsne_0.15                  ggbeeswarm_0.6.0

References

Gavish, M., and D. L. Donoho. 2014. “The Optimal Hard Threshold for Singular Values Is $$4/\sqrt {3}$$.” IEEE Transactions on Information Theory 60 (8): 5040–53.

Linderman, George C., Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, and Yuval Kluger. 2019. “Fast Interpolation-Based T-SNE for Improved Visualization of Single-Cell RNA-Seq Data.” Nature Methods 16 (3): 243–45. https://doi.org/10.1038/s41592-018-0308-4.

Narayan, Ashwin. 2021. “Assessing Single-Cell Transcriptomic Variability Through Density-Preserving Data Visualization.” Nature Biotechnology, 19. https://doi.org/10.1038/s41587-020-00801-7.

Shekhar, K., S. W. Lapan, I. E. Whitney, N. M. Tran, E. Z. Macosko, M. Kowalczyk, X. Adiconis, et al. 2016. “Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptomics.” Cell 166 (5): 1308–23.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.