Chapter 10 Clustering

10.1 Motivation

Clustering is an unsupervised learning procedure that is used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation. This allows us to describe population heterogeneity in terms of discrete labels that are easily understood, rather than attempting to comprehend the high-dimensional manifold on which the cells truly reside. After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data. Here, we demonstrate the application of several commonly used methods with the 10X PBMC dataset.

## class: SingleCellExperiment 
## dim: 33694 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(3): PCA TSNE UMAP
## altExpNames(0):

10.2 What is the “true clustering”?

At this point, it is worth stressing the distinction between clusters and cell types. The former is an empirical construct while the latter is a biological truth (albeit a vaguely defined one). For this reason, questions like “what is the true number of clusters?” are usually meaningless. We can define as many clusters as we like, with whatever algorithm we like - each clustering will represent its own partitioning of the high-dimensional expression space, and is as “real” as any other clustering.

A more relevant question is “how well do the clusters approximate the cell types?” Unfortunately, this is difficult to answer given the context-dependent interpretation of biological truth. Some analysts will be satisfied with resolution of the major cell types; other analysts may want resolution of subtypes; and others still may require resolution of different states (e.g., metabolic activity, stress) within those subtypes. Moreover, two clusterings can be highly inconsistent yet both valid, simply partitioning the cells based on different aspects of biology. Indeed, asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context.

It is helpful to realize that clustering, like a microscope, is simply a tool to explore the data. We can zoom in and out by changing the resolution of the clustering parameters, and we can experiment with different clustering algorithms to obtain alternative perspectives of the data. This iterative approach is entirely permissible for data exploration, which constitutes the majority of all scRNA-seq data analyses.

10.3 Graph-based clustering

10.3.1 Background

Popularized by its use in Seurat, graph-based clustering is a flexible and scalable technique for clustering large scRNA-seq datasets. We first build a graph where each node is a cell that is connected to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. We then apply algorithms to identify “communities” of cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that we can use for downstream interpretation.

The major advantage of graph-based clustering lies in its scalability. It only requires a \(k\)-nearest neighbor search that can be done in log-linear time on average, in contrast to hierachical clustering methods with runtimes that are quadratic with respect to the number of cells. Graph construction avoids making strong assumptions about the shape of the clusters or the distribution of cells within each cluster, compared to other methods like \(k\)-means (that favor spherical clusters) or Gaussian mixture models (that require normality). From a practical perspective, each cell is forcibly connected to a minimum number of neighboring cells, which reduces the risk of generating many uninformative clusters consisting of one or two outlier cells.

The main drawback of graph-based methods is that, after graph construction, no information is retained about relationships beyond the neighboring cells1. This has some practical consequences in datasets that exhibit differences in cell density, as more steps through the graph are required to move the same distance through a region of higher cell density. From the perspective of community detection algorithms, this effect “inflates” the high-density regions such that any internal substructure or noise is more likely to cause formation of subclusters. The resolution of clustering thus becomes dependent on the density of cells, which can occasionally be misleading if it overstates the heterogeneity in the data.

10.3.2 Implementation

There are several considerations in the practical execution of a graph-based clustering method:

  • How many neighbors are considered when constructing the graph.
  • What scheme is used to weight the edges.
  • Which community detection algorithm is used to define the clusters.

For example, the following code uses the 10 nearest neighbors of each cell to construct a shared nearest neighbor graph. Two cells are connected by an edge if any of their nearest neighbors are shared, with the edge weight defined from the highest average rank of the shared neighbors (Xu and Su 2015). The Walktrap method from the igraph package is then used to identify communities. All calculations are performed using the top PCs to take advantage of data compression and denoising.

## clust
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 205 508 541  56 374 125  46 432 302 867  47 155 166  61  84  16

Alternatively, users may prefer to use the clusterRows() function from the bluster package. This calls the exact same series of functions when run in graph-based mode with the NNGraphParam() argument; however, it is often more convenient if we want try out different clustering procedures, as we can simply change the second argument to use a different set of parameters or a different algorithm altogether.

## clust2
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 205 508 541  56 374 125  46 432 302 867  47 155 166  61  84  16

We assign the cluster assignments back into our SingleCellExperiment object as a factor in the column metadata. This allows us to conveniently visualize the distribution of clusters in a \(t\)-SNE plot (Figure 10.1).

$t$-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from graph-based clustering.

Figure 10.1: \(t\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from graph-based clustering.

One of the most important parameters is k, the number of nearest neighbors used to construct the graph. This controls the resolution of the clustering where higher k yields a more inter-connected graph and broader clusters. Users can exploit this by experimenting with different values of k to obtain a satisfactory resolution.

## clust.5
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 523 302 125  45 172 573 249 439 293  95 772 142  38  18  62  38  30  16  15   9 
##  21  22 
##  16  13
## clust.50
##   1   2   3   4   5   6   7   8   9  10 
## 869 514 194 478 539 944 138 175  89  45

The graph itself can be visualized using a force-directed layout (Figure 10.2). This yields a dimensionality reduction result that is closely related to \(t\)-SNE and UMAP, though which of these is the most aesthetically pleasing is left to the eye of the beholder.

Force-directed layout for the shared nearest-neighbor graph of the PBMC dataset. Each point represents a cell and is coloured according to its assigned cluster identity.

Figure 10.2: Force-directed layout for the shared nearest-neighbor graph of the PBMC dataset. Each point represents a cell and is coloured according to its assigned cluster identity.

10.3.3 Other parameters

Further tweaking can be performed by changing the edge weighting scheme during graph construction. Setting type="number" will weight edges based on the number of nearest neighbors that are shared between two cells. Similarly, type="jaccard" will weight edges according to the Jaccard index of the two sets of neighbors. We can also disable weighting altogether by using buildKNNGraph(), which is occasionally useful for downstream graph operations that do not support weights.

All of these g variables are graph objects from the igraph package and can be used with any of the community detection algorithms provided by igraph. We have already mentioned the Walktrap approach, but many others are available to choose from:

It is then straightforward to compare two clustering strategies to see how they differ. For example, Figure 10.3 suggests that Infomap yields finer clusters than Walktrap while fast-greedy yields coarser clusters.

Number of cells assigned to combinations of cluster labels with different community detection algorithms in the PBMC dataset. Each entry of each heatmap represents a pair of labels, coloured proportionally to the log-number of cells with those labels.

Figure 10.3: Number of cells assigned to combinations of cluster labels with different community detection algorithms in the PBMC dataset. Each entry of each heatmap represents a pair of labels, coloured proportionally to the log-number of cells with those labels.

Pipelines involving scran default to rank-based weights followed by Walktrap clustering. In contrast, Seurat uses Jaccard-based weights followed by Louvain clustering. Both of these strategies work well, and it is likely that the same could be said for many other combinations of weighting schemes and community detection algorithms.

Some community detection algorithms operate by agglomeration and thus can be used to construct a hierarchical dendrogram based on the pattern of merges between clusters. The dendrogram itself is not particularly informative as it simply describes the order of merge steps performed by the algorithm; unlike the dendrograms produced by hierarchical clustering (Section 10.5), it does not capture the magnitude of differences between subpopulations. However, it does provide a convenient avenue for manually tuning the clustering resolution by generating nested clusterings using the cut_at() function, as shown below.

## 
##    1    2    3    4    5 
## 3546  221  125   46   47
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 462 374 125 437  46 432 302 173 867  47 155 166 104  40  61  84  46  32  16  16

If cut_at()-like functionality is desired for non-hierarchical methods, bluster provides a mergeCommunities() function to retrospectively tune the clustering resolution. This function will greedily merge pairs of clusters until a specified number of clusters is achieved, where pairs are chosen to maximize the modularity at each merge step.

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 353 152 376 857  48  46  68 661 287  53 541 198 219 126
## Error in igraph::cut_at(community.louvain, n = 10) : 
##   Not a hierarchical communitity structure
## merged
##   1   3   4   7   8   9  10  11  12  14 
## 353 528 857 162 661 287 272 541 198 126

10.3.4 Assessing cluster separation

When dealing with graphs, the modularity is a natural metric for evaluating the separation between communities/clusters. This is defined as the (scaled) difference between the observed total weight of edges between nodes in the same cluster and the expected total weight if edge weights were randomly distributed across all pairs of nodes. Larger modularity values indicate that there most edges occur within clusters, suggesting that the clusters are sufficiently well separated to avoid edges forming between neighboring cells in different clusters.

The standard approach is to report a single modularity value for a clustering on a given graph. This is useful for comparing different clusterings on the same graph - and indeed, some community detection algorithms are designed with the aim of maximizing the modularity - but it is less helpful for interpreting a given clustering. Rather, we use the pairwiseModularity() function from bluster with as.ratio=TRUE, which returns the ratio of the observed to expected sum of weights between each pair of clusters. We use the ratio instead of the difference as the former is less dependent on the number of cells in each cluster.

## [1] 16 16

In this matrix, each row/column corresponds to a cluster and each entry contains the ratio of the observed to total weight of edges between cells in the respective clusters. A dataset containing well-separated clusters should contain most of the observed total weight on the diagonal entries, i.e., most edges occur between cells in the same cluster. Indeed, concentration of the weight on the diagonal of (Figure 10.4) indicates that most of the clusters are well-separated, while some modest off-diagonal entries represent closely related clusters with more inter-connecting edges.

Heatmap of the log~2~-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links.

Figure 10.4: Heatmap of the log2-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links.

One useful approach is to use the ratio matrix to form another graph where the nodes are clusters rather than cells. Edges between nodes are weighted according to the ratio of observed to expected edge weights between cells in those clusters. We can then repeat our graph operations on this new cluster-level graph to explore the relationships between clusters. For example, we could obtain clusters of clusters, or we could simply create a new cluster-based layout for visualization (Figure 10.5). This is analogous to the “graph abstraction” approach described by Wolf et al. (2017), which can be used to identify trajectories in the data based on high-weight paths between clusters.

Force-based layout showing the relationships between clusters based on the log-ratio of observed to expected total weights between nodes in different clusters. The thickness of the edge between a pair of clusters is proportional to the corresponding log-ratio.

Figure 10.5: Force-based layout showing the relationships between clusters based on the log-ratio of observed to expected total weights between nodes in different clusters. The thickness of the edge between a pair of clusters is proportional to the corresponding log-ratio.

Incidentally, some readers may have noticed that all igraph commands were prefixed with igraph::. We have done this deliberately to avoid bringing igraph::normalize into the global namespace. Rather unfortunately, this normalize function accepts any argument and returns NULL, which causes difficult-to-diagnose bugs when it overwrites normalize from BiocGenerics.

10.4 \(k\)-means clustering

10.4.1 Background

\(k\)-means clustering is a classic technique that aims to partition cells into \(k\) clusters. Each cell is assigned to the cluster with the closest centroid, which is done by minimizing the within-cluster sum of squares using a random starting configuration for the \(k\) centroids. The main advantage of this approach lies in its speed, given the simplicity and ease of implementation of the algorithm. However, it suffers from a number of serious shortcomings that reduce its appeal for obtaining interpretable clusters:

  • It implicitly favors spherical clusters of equal radius. This can lead to unintuitive partitionings on real datasets that contain groupings with irregular sizes and shapes.
  • The number of clusters \(k\) must be specified beforehand and represents a hard cap on the resolution of the clustering.. For example, setting \(k\) to be below the number of cell types will always lead to co-clustering of two cell types, regardless of how well separated they are. In contrast, other methods like graph-based clustering will respect strong separation even if the relevant resolution parameter is set to a low value.
  • It is dependent on the randomly chosen initial coordinates. This requires multiple runs to verify that the clustering is stable.

That said, \(k\)-means clustering is still one of the best approaches for sample-based data compression. In this application, we set \(k\) to a large value such as the square root of the number of cells to obtain fine-grained clusters. These are not meant to be interpreted directly, but rather, the centroids are treated as “samples” for further analyses. The idea here is to obtain a single representative of each region of the expression space, reducing the number of samples and computational work in later steps like, e.g., trajectory reconstruction (Ji and Ji 2016). This approach will also eliminate differences in cell density across the expression space, ensuring that the most abundant cell type does not dominate downstream results.

10.4.2 Base implementation

Base R provides the kmeans() function that does as its name suggests. We call this on our top PCs to obtain a clustering for a specified number of clusters in the centers= argument, after setting the random seed to ensure that the results are reproducible. In general, the \(k\)-means clusters correspond to the visual clusters on the \(t\)-SNE plot in Figure 10.6, though there are some divergences that are not observed in, say, Figure 10.1. (This is at least partially due to the fact that \(t\)-SNE is itself graph-based and so will naturally agree more with a graph-based clustering strategy.)

## 
##   1   2   3   4   5   6   7   8   9  10 
## 548  46 408 270 539 199 148 783 163 881
$t$-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from $k$-means clustering.

Figure 10.6: \(t\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \(k\)-means clustering.

If we were so inclined, we could obtain a “reasonable” choice of \(k\) by computing the gap statistic using methods from the cluster package. This is the log-ratio of the expected to observed within-cluster sum of squares, where the expected value is computed by randomly distributing cells within the minimum bounding box of the original data. A larger gap statistic represents a lower observed sum of squares - and thus better clustering - compared to a population with no structure. Ideally, we would choose the \(k\) that maximizes the gap statistic, but this is often unhelpful as the tendency of \(k\)-means to favor spherical clusters drives a large \(k\) to capture different cluster shapes. Instead, we choose the most parsimonious \(k\) beyond which the increases in the gap statistic are considered insignificant (Figure 10.7). It must be said, though, that this process is time-consuming and the resulting choice of \(k\) is not always stable.

## [1] 8
Gap statistic with respect to increasing number of $k$-means clusters in the 10X PBMC dataset. The red line represents the chosen $k$.

Figure 10.7: Gap statistic with respect to increasing number of \(k\)-means clusters in the 10X PBMC dataset. The red line represents the chosen \(k\).

A more practical use of \(k\)-means is to deliberately set \(k\) to a large value to achieve overclustering. This will forcibly partition cells inside broad clusters that do not have well-defined internal structure. For example, we might be interested in the change in expression from one “side” of a cluster to the other, but the lack of any clear separation within the cluster makes it difficult to separate with graph-based methods, even at the highest resolution. \(k\)-means has no such problems and will readily split these broad clusters for greater resolution, though obviously one must be prepared for the additional work involved in interpreting a greater number of clusters.

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 243  28 202 361 282 166 388 150 114 537 170  96  46 131 162 118 201 257 288  45
$t$-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from $k$-means clustering with $k=20$.

Figure 10.8: \(t\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \(k\)-means clustering with \(k=20\).

As an aside: if we were already using clusterRows() from bluster, we can easily switch to \(k\)-means clustering by supplying a KmeansParam() as the second argument. This requires the number of clusters as a fixed integer or as a function of the number of cells - the example below sets the number of clusters to the square root of the number of cells, which is an effective rule-of-thumb for vector quantization.

## [1] 63

10.4.3 Assessing cluster separation

The within-cluster sum of squares (WCSS) for each cluster is the most relevant diagnostic for \(k\)-means, given that the algorithm aims to find a clustering that minimizes the WCSS. Specifically, we use the WCSS to compute the root-mean-squared deviation (RMSD) that represents the spread of cells within each cluster. A cluster is more likely to have a low RMSD if it has no internal structure and is separated from other clusters (such that there are not many cells on the boundaries between clusters, which would result in a higher sum of squares from the centroid).

##    wcss ncells    rms
## 1  3270    243  3.669
## 2  2837     28 10.066
## 3  3240    202  4.005
## 4  3499    361  3.113
## 5  4483    282  3.987
## 6  3325    166  4.476
## 7  6834    388  4.197
## 8  3843    150  5.062
## 9  2277    114  4.470
## 10 4439    537  2.875
## 11 2003    170  3.433
## 12 3342     96  5.900
## 13 6531     46 11.915
## 14 2130    131  4.032
## 15 3627    162  4.731
## 16 3108    118  5.132
## 17 4790    201  4.882
## 18 4663    257  4.260
## 19 6966    288  4.918
## 20 1205     45  5.175

(As an aside, the RMSDs of the clusters are poorly correlated with their sizes in Figure 10.8. This highlights the risks of attempting to quantitatively interpret the sizes of visual clusters in \(t\)-SNE plots.)

To explore the relationships between \(k\)-means clusters, a natural approach is to compute distances between their centroids. This directly lends itself to visualization as a tree after hierarchical clustering (Figure 10.9).

Hierarchy of $k$-means cluster centroids, using Ward's minimum variance method.

Figure 10.9: Hierarchy of \(k\)-means cluster centroids, using Ward’s minimum variance method.

10.4.4 In two-step procedures

As previously mentioned, \(k\)-means is most effective in its role of vector quantization, i.e., compressing adjacent cells into a single representative point. This allows \(k\)-means to be used as a prelude to more sophisticated and interpretable - but expensive - clustering algorithms. The clusterRows() function supports a “two-step” mode where \(k\)-means is initially used to obtain representative centroids that are subjected to graph-based clustering. Each cell is then placed in the same graph-based cluster that its \(k\)-means centroid was assigned to (Figure 10.10).

## kgraph.clusters
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 191 854 506 541 541 892  46 120  29 132  47  86
$t$-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined $k$-means/graph-based clustering.

Figure 10.10: \(t\)-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined \(k\)-means/graph-based clustering.

The obvious benefit of this approach over direct graph-based clustering is the speed improvement. We avoid the need to identifying nearest neighbors for each cell and the construction of a large intermediate graph, while benefiting from the relative interpretability of graph-based clusters compared to those from \(k\)-means. This approach also mitigates the “inflation” effect discussed in Section 10.3. Each centroid serves as a representative of a region of space that is roughly similar in volume, ameliorating differences in cell density that can cause (potentially undesirable) differences in resolution.

The choice of the number of \(k\)-means clusters (defined here by the kmeans.clusters= argument) determines the trade-off between speed and fidelity. Larger values provide a more faithful representation of the underlying distribution of cells, at the cost of requiring more computational work by the second-stage clustering procedure. Note that the second step operates on the centroids, so increasing kmeans.clusters= may have further implications if the second-stage procedure is sensitive to the total number of input observations. For example, increasing the number of centroids would require an concomitant increase in k= (the number of neighbors in graph construction) to maintain the same level of resolution in the final output.

10.5 Hierarchical clustering

10.5.1 Background

Hierarchical clustering is an ancient technique that aims to generate a dendrogram containing a hierarchy of samples. This is most commonly done by greedily agglomerating samples into clusters, then agglomerating those clusters into larger clusters, and so on until all samples belong to a single cluster. Variants of hierarchical clustering methods primarily differ in how they choose to perform the agglomerations. For example, complete linkage aims to merge clusters with the smallest maximum distance between their elements, while Ward’s method aims to minimize the increase in within-cluster variance.

In the context of scRNA-seq, the main advantage of hierarchical clustering lies in the production of the dendrogram. This is a rich summary that describes the relationships between cells and subpopulations at various resolutions and in a quantitative manner based on the branch lengths. Users can easily “cut” the tree at different heights to define clusters with different granularity, where clusters defined at high resolution are guaranteed to be nested within those defined at a lower resolution. (Guaranteed nesting can be helpful for interpretation, as discussed in Section 10.7.) The dendrogram is also a natural representation of the data in situations where cells have descended from a relatively recent common ancestor.

In practice, hierachical clustering is too slow to be used for anything but the smallest scRNA-seq datasets. Most variants require a cell-cell distance matrix that is prohibitively expensive to compute for many cells. Greedy agglomeration is also likely to result in a quantitatively suboptimal partitioning (as defined by the agglomeration measure) at higher levels of the dendrogram when the number of cells and merge steps is high. Nonetheless, we will still demonstrate the application of hierarchical clustering here, as it can occasionally be useful for squeezing more information out of datasets with very few cells.

10.5.2 Implementation

As the PBMC dataset is too large, we will demonstrate on the 416B dataset instead.

#--- loading ---#
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b") 
sce.416b$block <- factor(sce.416b$block)

#--- gene-annotation ---#
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SYMBOL")
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SEQNAME")

library(scater)
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, 
    rowData(sce.416b)$SYMBOL)

#--- quality-control ---#
mito <- which(rowData(sce.416b)$SEQNAME=="MT")
stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))
qc <- quickPerCellQC(stats, percent_subsets=c("subsets_Mt_percent",
    "altexps_ERCC_percent"), batch=sce.416b$block)
sce.416b <- sce.416b[,!qc$discard]

#--- normalization ---#
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)

#--- variance-modelling ---#
dec.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC", block=sce.416b$block)
chosen.hvgs <- getTopHVGs(dec.416b, prop=0.1)

#--- batch-correction ---#
library(limma)
assay(sce.416b, "corrected") <- removeBatchEffect(logcounts(sce.416b), 
    design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block)

#--- dimensionality-reduction ---#
sce.416b <- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs,
    exprs_values="corrected", BSPARAM=BiocSingular::ExactParam())

set.seed(1010)
sce.416b <- runTSNE(sce.416b, dimred="PCA", perplexity=10)
## class: SingleCellExperiment 
## dim: 46604 185 
## metadata(0):
## assays(3): counts logcounts corrected
## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1
##   CBFB-MYH11-mcherry
## rowData names(4): Length ENSEMBL SYMBOL SEQNAME
## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(10): Source Name cell line ... block sizeFactor
## reducedDimNames(2): PCA TSNE
## altExpNames(2): ERCC SIRV

We compute a cell-cell distance matrix using the top PCs and we apply hierarchical clustering with Ward’s method. The resulting tree in Figure 10.11 shows a clear split in the population caused by oncogene induction. While both Ward’s method and complete linkage (hclust()’s default) yield compact clusters, we prefer the former it is less affected by differences in variance between clusters.

Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its oncogene induction status (red is induced, blue is control) and plate of origin (light or dark).

Figure 10.11: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its oncogene induction status (red is induced, blue is control) and plate of origin (light or dark).

To obtain explicit clusters, we “cut” the tree by removing internal branches such that every subtree represents a distinct cluster. This is most simply done by removing internal branches above a certain height of the tree, as performed by the cutree() function. A more sophisticated variant of this approach is implemented in the dynamicTreeCut package, which uses the shape of the branches to obtain a better partitioning for complex dendrograms (Figure 10.12).

##  ..cutHeight not given, setting it to 783  ===>  99% of the (truncated) height range in dendro.
##  ..done.
## clust.416b
##  1  2  3  4 
## 78 69 24 14
Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its assigned cluster identity from a dynamic tree cut.

Figure 10.12: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its assigned cluster identity from a dynamic tree cut.

This generally corresponds well to the grouping of cells on a \(t\)-SNE plot (Figure 10.13). The exception is cluster 2, which is split across two visual clusters in the plot. We attribute this to a distortion introduced by \(t\)-SNE rather than inappropriate behavior of the clustering algorithm, based on the examination of some later diagnostics.

$t$-SNE plot of the 416B dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from hierarchical clustering.

Figure 10.13: \(t\)-SNE plot of the 416B dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from hierarchical clustering.

Note that the series of calls required to obtain the clusters is also wrapped by clusterRows() for more convenient execution. In this case, we can reproduce clust.416b with the following:

## clust.416b.again
##  1  2  3  4 
## 78 69 24 14

10.5.3 Assessing cluster separation

We check the separation of the clusters using the silhouette width (Figure 10.14). For each cell, we compute the average distance to all cells in the same cluster. We also compute the average distance to all cells in another cluster, taking the minimum of the averages across all other clusters. The silhouette width for each cell is defined as the difference between these two values divided by their maximum. Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters.

Each cluster would ideally contain large positive silhouette widths, indicating that it is well-separated from other clusters. This is indeed the case in Figure 10.14 - and in fact, cluster 2 has the largest width of all, indicating that it is a more coherent cluster than portrayed in Figure 10.13. Smaller widths can arise from the presence of internal subclusters, which inflates the within-cluster distance; or overclustering, where cells at the boundary of a partition are closer to the neighboring cluster than their own cluster.

Silhouette widths for cells in each cluster in the 416B dataset. Each bar represents a cell, grouped by the cluster to which it is assigned.

Figure 10.14: Silhouette widths for cells in each cluster in the 416B dataset. Each bar represents a cell, grouped by the cluster to which it is assigned.

For a more detailed examination, we identify the closest neighboring cluster for cells with negative widths. This provides a perspective on the relationships between clusters that is closer to the raw data than the dendrogram in Figure 10.12.

##        Neighbor
## Cluster 1 2 3
##       2 0 0 3
##       3 1 3 0

The average silhouette width across all cells can also be used to choose clustering parameters. The aim is to maximize the average silhouette width in order to obtain well-separated clusters. This can be helpful to automatically obtain a “reasonable” clustering, though in practice, the clustering that yields the strongest separation often does not provide the most biological insight.

10.6 General-purpose cluster diagnostics

10.6.1 Cluster separation, redux

We previously introduced the silhouette width in the context of hierarchical clustering (Section 10.5.3). While this can be applied with other clustering algorithms, it requires calculation of all pairwise distances between cells and is not scalable for larger cdatasets. In such cases, we instead use an approximate approach that replaces the average of the distances with the distance to the average (i.e., centroid) of each cluster, with some tweaks to account for the distance due to the within-cluster variance. This is implemented in the approxSilhouette() function from bluster, allowing us to quickly identify poorly separate clusters with mostly negative widths (Figure 10.15).

Distribution of the approximate silhouette width across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of its own cluster if its silhouette width is positive and that of the closest other cluster if the width is negative.

Figure 10.15: Distribution of the approximate silhouette width across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of its own cluster if its silhouette width is positive and that of the closest other cluster if the width is negative.

Alternatively, we can quantify the degree to which cells from multiple clusters intermingle in expression space. The “clustering purity” is defined for each cell as the proportion of neighboring cells that are assigned to the same cluster. Well-separated clusters should exhibit little intermingling and thus high purity values for all member cells, as demonstrated below in Figure 10.16. Median purity values are consistently greater than 0.9, indicating that most cells in each cluster are primarily surrounded by other cells from the same cluster.

Distribution of cluster purities across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of the cluster contributing the largest proportion of its neighbors.

Figure 10.16: Distribution of cluster purities across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of the cluster contributing the largest proportion of its neighbors.

The main difference between these two methods is that the purity is ignorant of the intra-cluster variance. This may or may not be desirable depending on what level of heterogeneity is of interest. In addition, the purity will - on average - only decrease with increasing cluster number/resolution, making it less effective for choosing between different clusterings. However, regardless of the chosen method, it is worth keeping in mind that poor separation is not synonymous with poor quality. In fact, poorly separated clusters will often be observed in non-trivial analyses of scRNA-seq data where the aim is to characterize closely related subtypes or states. These diagnostics are best used to guide interpretation by highlighting clusters that require more investigation rather than to rule out poorly separated clusters altogether.

10.6.2 Comparing different clusterings

As previously mentioned, clustering’s main purpose is to obtain a discrete summary of the data for further interpretation. The diversity of available methods (and the subsequent variation in the clustering results) reflects the many different “perspectives” that can be derived from a high-dimensional scRNA-seq dataset. It is helpful to determine how these perspectives relate to each other by comparing the clustering results. More concretely, we want to know which clusters map to each other across algorithms; inconsistencies may be indicative of complex variation that is summarized differently by each clustering procedure.

A simple yet effective approach for comparing two clusterings of the same dataset is to create a 2-dimensional table of label frequencies (Figure 10.3). We can further improve the interpretability of this table by computing the proportions of cell assignments, which avoids difficulties with dynamic range when visualizing clusters of differing abundances, For example, we may be interested in how our Walktrap clusters from Section 10.3 are redistributed when we switch to using Louvain community detection (Figure 10.17). Note that this heatmap is best interpreted on a row-by-row basis as the proportions are computed within each row and cannot be easily compared between rows.

Heatmap of the proportions of cells from each Walktrap cluster (rows) across the Louvain clusters (columns) in the PBMC dataset. Each row represents the distribution of cells across Louvain clusters for a given Walktrap cluster.

Figure 10.17: Heatmap of the proportions of cells from each Walktrap cluster (rows) across the Louvain clusters (columns) in the PBMC dataset. Each row represents the distribution of cells across Louvain clusters for a given Walktrap cluster.

For clusterings that differ primarily in resolution (usually from different parameterizations of the same algorithm), we can use the clustree package to visualize the relationships between them. Here, the aim is to capture the redistribution of cells from one clustering to another at progressively higher resolution, providing a convenient depiction of how clusters split apart (Figure 10.18). This approach is most effective when the clusterings exhibit a clear gradation in resolution but is less useful for comparisons involving theoretically distinct clustering procedures.

Graph of the relationships between the Walktrap clusterings of the PBMC dataset, generated with varying $k$ during the nearest-neighbor graph construction. (A higher $k$ generally corresponds to a lower resolution clustering.) The size of the nodes is proportional to the number of cells in each cluster, and the edges depict cells in one cluster that are reassigned to another cluster at a different resolution. The color of the edges is defined according to the number of reassigned cells and the opacity is defined from the corresponding proportion relative to the size of the lower-resolution cluster.

Figure 10.18: Graph of the relationships between the Walktrap clusterings of the PBMC dataset, generated with varying \(k\) during the nearest-neighbor graph construction. (A higher \(k\) generally corresponds to a lower resolution clustering.) The size of the nodes is proportional to the number of cells in each cluster, and the edges depict cells in one cluster that are reassigned to another cluster at a different resolution. The color of the edges is defined according to the number of reassigned cells and the opacity is defined from the corresponding proportion relative to the size of the lower-resolution cluster.

We can quantify the agreement between two clusterings by computing the Rand index with bluster’s pairwiseRand(). This is defined as the proportion of pairs of cells that retain the same status (i.e., both cells in the same cluster, or each cell in different clusters) in both clusterings. In practice, we usually compute the adjusted Rand index (ARI) where we subtract the number of concordant pairs expected under random permutations of the clusterings; this accounts for differences in the size and number of clusters within and between clusterings. A larger ARI indicates that the clusters are preserved, up to a maximum value of 1 for identical clusterings. In and of itself, the magnitude of the ARI has little meaning, and it is best used to assess the relative similarities of different clusterings (e.g., “Walktrap is more similar to Louvain than either are to Infomap”). Nonetheless, if one must have a hard-and-fast rule, experience suggests that an ARI greater than 0.5 corresponds to “good” similarity between two clusterings.

## [1] 0.7796

The same function can also provide a more granular perspective with mode="ratio", where the ARI is broken down into its contributions from each pair of clusters in one of the clusterings. This mode is helpful if one of the clusterings - in this case, clust - is considered to be a “reference”, and the aim is to quantify the extent to which the reference clusters retain their integrity in another clustering. In the breakdown matrix, each entry is a ratio of the adjusted number of concoordant pairs to the adjusted total number of pairs. Low values on the diagonal in Figure 10.19 indicate that cells from the corresponding reference cluster in clust are redistributed to multiple other clusters in clust.5. Conversely, low off-diagonal values indicate that the corresponding pair of reference clusters are merged together in clust.5.

ARI-based ratio for each pair of clusters in the reference Walktrap clustering compared to a higher-resolution alternative clustering for the PBMC dataset. Rows and columns of the heatmap represent clusters in the reference clustering. Each entry represents the proportion of pairs of cells involving the row/column clusters that retain the same status in the alternative clustering.

Figure 10.19: ARI-based ratio for each pair of clusters in the reference Walktrap clustering compared to a higher-resolution alternative clustering for the PBMC dataset. Rows and columns of the heatmap represent clusters in the reference clustering. Each entry represents the proportion of pairs of cells involving the row/column clusters that retain the same status in the alternative clustering.

10.6.3 Evaluating cluster stability

A desirable property of a given clustering is that it is stable to perturbations to the input data (Von Luxburg 2010). Stable clusters are logistically convenient as small changes to upstream processing will not change the conclusions; greater stability also increases the likelihood that those conclusions can be reproduced in an independent replicate study. scran uses bootstrapping to evaluate the stability of a clustering algorithm on a given dataset - that is, cells are sampled with replacement to create a “bootstrap replicate” dataset, and clustering is repeated on this replicate to see if the same clusters can be reproduced. We demonstrate below for graph-based clustering on the PCs of the PBMC dataset.

## originals
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 127  62  48 343  45  56 124  94 848 290 200 459 233 143 541 372
## [1] 16 16

The function returns a matrix of ARI-derived ratios for every pair of original clusters in originals (Figure 10.20), averaged across bootstrap iterations. High ratios indicate that the clustering in the bootstrap replicates are highly consistent with that of the original dataset. More specifically, high ratios on the diagonal indicate that cells in the same original cluster are still together in the bootstrap replicates, while high ratios off the diagonal indicate that cells in the corresponding cluster pair are still separted.

Heatmap of ARI-derived ratios from bootstrapping of graph-based clustering in the PBMC dataset. Each row and column represents an original cluster and each entry is colored according to the value of the ARI ratio between that pair of clusters.

Figure 10.20: Heatmap of ARI-derived ratios from bootstrapping of graph-based clustering in the PBMC dataset. Each row and column represents an original cluster and each entry is colored according to the value of the ARI ratio between that pair of clusters.

Bootstrapping is a general approach for evaluating cluster stability that is compatible with any clustering algorithm. The ARI-derived ratio between cluster pairs is also more informative than a single stability measure for all/each cluster as the former considers the relationships between clusters, e.g., unstable separation between \(X\) and \(Y\) does not penalize the stability of separation between \(X\) and another cluster \(Z\). Of course, one should take these metrics with a grain of salt, as bootstrapping only considers the effect of sampling noise and ignores other factors that affect reproducibility in an independent study (e.g., batch effects, donor variation). In addition, it is possible for a poor separation to be highly stable, so highly stable cluster may not necessarily represent some distinct subpopulation.

10.7 Subclustering

Another simple approach to improving resolution is to repeat the feature selection and clustering within a single cluster. This aims to select HVGs and PCs that are more relevant to internal structure, improving resolution by avoiding noise from unnecessary features. Subsetting also encourages clustering methods to separate cells according to more modest heterogeneity in the absence of distinct subpopulations. We demonstrate with a cluster of putative memory T cells from the PBMC dataset, identified according to several markers (Figure 10.21).

Distribution of log-normalized expression values for several T cell markers within each cluster in the 10X PBMC dataset. Each cluster is color-coded for convenience.

Figure 10.21: Distribution of log-normalized expression values for several T cell markers within each cluster in the 10X PBMC dataset. Each cluster is color-coded for convenience.

We apply graph-based clustering within this memory subset to obtain CD4+ and CD8+ subclusters (Figure 10.22). Admittedly, the expression of CD4 is so low that the change is rather modest, but the interpretation is clear enough.

Distribution of _CD4_ and _CD8A_ log-normalized expression values within each cluster in the memory T cell subset of the 10X PBMC dataset.

Figure 10.22: Distribution of CD4 and CD8A log-normalized expression values within each cluster in the memory T cell subset of the 10X PBMC dataset.

For subclustering analyses, it is helpful to define a customized function that calls our desired algorithms to obtain a clustering from a given SingleCellExperiment. This function can then be applied multiple times on different subsets without having to repeatedly copy and modify the code for each subset. For example, quickSubCluster() loops over all subsets and executes this user-specified function to generate a list of SingleCellExperiment objects containing the subclustering results. (Of course, the downside is that this assumes that a similar analysis is appropriate for each subset. If different subsets require extensive reparametrization, copying the code may actually be more straightforward.)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16"
## 
## 1.1 1.2 1.3 1.4 1.5 1.6 
##  28  22  34  62  11  48

Subclustering is a general and conceptually straightforward procedure for increasing resolution. It can also simplify the interpretation of the subclusters, which only need to be considered in the context of the parent cluster’s identity - for example, we did not have to re-identify the cells in cluster 10 as T cells. However, this is a double-edged sword as it is difficult for practitioners to consider the uncertainty of identification for parent clusters when working with deep nesting. If cell types or states span cluster boundaries, conditioning on the putative cell type identity of the parent cluster can encourage the construction of a “house of cards” of cell type assignments, e.g., where a subcluster of one parent cluster is actually contamination from a cell type in a separate parent cluster.

Session Info

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] clustree_0.4.3              ggraph_2.0.5               
 [3] dynamicTreeCut_1.63-1       dendextend_1.14.0          
 [5] cluster_2.1.0               pheatmap_1.0.12            
 [7] scater_1.18.6               ggplot2_3.3.3              
 [9] bluster_1.0.0               scran_1.18.5               
[11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
[13] Biobase_2.50.0              GenomicRanges_1.42.0       
[15] GenomeInfoDb_1.26.4         IRanges_2.24.1             
[17] S4Vectors_0.28.1            BiocGenerics_0.36.0        
[19] MatrixGenerics_1.2.1        matrixStats_0.58.0         
[21] BiocStyle_2.18.1            rebook_1.0.0               

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0          colorspace_2.0-0         
 [3] ellipsis_0.3.1            scuttle_1.0.4            
 [5] XVector_0.30.0            BiocNeighbors_1.8.2      
 [7] farver_2.1.0              graphlayouts_0.7.1       
 [9] ggrepel_0.9.1             fansi_0.4.2              
[11] codetools_0.2-18          sparseMatrixStats_1.2.1  
[13] knitr_1.31                polyclip_1.10-0          
[15] jsonlite_1.7.2            graph_1.68.0             
[17] ggforce_0.3.3             BiocManager_1.30.10      
[19] compiler_4.0.4            backports_1.2.1          
[21] dqrng_0.2.1               assertthat_0.2.1         
[23] Matrix_1.3-2              limma_3.46.0             
[25] tweenr_1.0.1              BiocSingular_1.6.0       
[27] htmltools_0.5.1.1         tools_4.0.4              
[29] rsvd_1.0.3                igraph_1.2.6             
[31] gtable_0.3.0              glue_1.4.2               
[33] GenomeInfoDbData_1.2.4    dplyr_1.0.5              
[35] Rcpp_1.0.6                jquerylib_0.1.3          
[37] vctrs_0.3.6               DelayedMatrixStats_1.12.3
[39] xfun_0.22                 stringr_1.4.0            
[41] ps_1.6.0                  beachmat_2.6.4           
[43] lifecycle_1.0.0           irlba_2.3.3              
[45] statmod_1.4.35            XML_3.99-0.6             
[47] edgeR_3.32.1              zlibbioc_1.36.0          
[49] MASS_7.3-53               scales_1.1.1             
[51] tidygraph_1.2.0           RColorBrewer_1.1-2       
[53] yaml_2.2.1                gridExtra_2.3            
[55] sass_0.3.1                stringi_1.5.3            
[57] highr_0.8                 checkmate_2.0.0          
[59] BiocParallel_1.24.1       rlang_0.4.10             
[61] pkgconfig_2.0.3           bitops_1.0-6             
[63] evaluate_0.14             lattice_0.20-41          
[65] purrr_0.3.4               CodeDepends_0.6.5        
[67] labeling_0.4.2            cowplot_1.1.1            
[69] processx_3.4.5            tidyselect_1.1.0         
[71] magrittr_2.0.1            bookdown_0.21            
[73] R6_2.5.0                  generics_0.1.0           
[75] DelayedArray_0.16.2       DBI_1.1.1                
[77] pillar_1.5.1              withr_2.4.1              
[79] RCurl_1.98-1.3            tibble_3.1.0             
[81] crayon_1.4.1              utf8_1.2.1               
[83] rmarkdown_2.7             viridis_0.5.1            
[85] locfit_1.5-9.4            grid_4.0.4               
[87] callr_3.5.1               digest_0.6.27            
[89] tidyr_1.1.3               munsell_0.5.0            
[91] beeswarm_0.3.1            viridisLite_0.3.0        
[93] vipor_0.4.5               bslib_0.2.4              

Bibliography

Ji, Z., and H. Ji. 2016. “TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.” Nucleic Acids Res. 44 (13): e117.

Von Luxburg, U. 2010. “Clustering Stability: An Overview.” Foundations and Trends in Machine Learning 2 (3): 235–74.

Wolf, F. Alexander, Fiona Hamey, Mireya Plass, Jordi Solana, Joakim S. Dahlin, Berthold Gottgens, Nikolaus Rajewsky, Lukas Simon, and Fabian J. Theis. 2017. “Graph Abstraction Reconciles Clustering with Trajectory Inference Through a Topology Preserving Map of Single Cells.” bioRxiv. https://doi.org/10.1101/208819.

Xu, C., and Z. Su. 2015. “Identification of cell types from single-cell transcriptomes using a novel clustering method.” Bioinformatics 31 (12): 1974–80.


  1. Sten Linarrsson talked about this in SCG2018, but I don’t know where that work ended up. So this is what passes as a reference for the time being.