# Chapter 5 Clustering, redux

## 5.1 Motivation

Basic Chapter 5 described the process of clustering cells to obtain discrete summaries of scRNA-seq datasets.
Here, we describe some diagnostics to evaluate clustering separation and stability,
methods to compare clusterings that represent different views of the data,
and some strategies to choose the number of clusters.
We will again be demonstrating these techniques on the 10X Genomics PBMC dataset, clustered using the default graph-based method in `clusterCells()`

.

```
#--- loading ---#
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)
library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
#--- gene-annotation ---#
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
#--- cell-detection ---#
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
#--- quality-control ---#
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]
#--- normalization ---#
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)
#--- variance-modelling ---#
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)
#--- dimensionality-reduction ---#
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)
set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")
```

```
library(scran)
nn.clust <- clusterCells(sce.pbmc, use.dimred="PCA", full=TRUE)
colLabels(sce.pbmc) <- nn.clust$clusters
sce.pbmc
```

```
## class: SingleCellExperiment
## dim: 33694 3985
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(4): Sample Barcode sizeFactor label
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
```

## 5.2 Quantifying clustering behavior

### 5.2.1 Motivation

Several metrics are available to quantify the behavior of a clustering method, such as the silhouette width or graph modularity. These metrics are most helpful when they can be computed for each cluster, allowing us to prioritize certain clusters for more careful analysis. Poorly separated clusters can be prioritized for manual inspection to determine the relevance of the differences to its neighboring clusters, while heterogeneous clusters may be subject to further clustering to identify internal structure. We can also quantify the effect of different algorithms or parameter values, providing some insight into how our dataset responds to changes in the clustering method.

As an aside: it is also possible to choose the clustering that optimizes for some of these metrics. This can be helpful to automatically obtain a “reasonable” clustering, though in practice, the clustering that yields the strongest separation often does not provide the most biological insight. Poorly separated clusters will often be observed in non-trivial analyses of scRNA-seq data where the aim is to characterize closely related subtypes or states. Indeed, the most well-separated clusters are rarely interesting as these describe obvious differences between known cell types. These diagnostics are best used to guide interpretation by highlighting clusters that require more investigation rather than to filter out “bad” clusters altogether.

### 5.2.2 Silhouette width

The silhouette width is an established metric for evaluating cluster separation. For each cell, we compute the average distance to all cells in the same cluster. We also compute the average distance to all cells in another cluster, taking the minimum of the averages across all other clusters. The silhouette width for each cell is defined as the difference between these two values divided by their maximum. Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. Thus, clusters with large positive silhouette widths are well-separated from other clusters.

The silhouette width is a natural diagnostic for hierarchical clustering where the distance matrix is already available.
For larger datasets, we instead use an approximate approach that uses the root of the average squared distances rather than the average distance itself.
The approximation avoids the time-consuming calculation of pairwise distances that would otherwise make this metric impractical.
This is implemented in the `approxSilhouette()`

function from *bluster*,
allowing us to quickly identify poorly separate clusters with mostly negative widths (Figure 5.1).

```
# Performing the calculations on the PC coordinates, like before.
library(bluster)
sil.approx <- approxSilhouette(reducedDim(sce.pbmc, "PCA"), clusters=colLabels(sce.pbmc))
sil.approx
```

```
## DataFrame with 3985 rows and 3 columns
## cluster other width
## <factor> <factor> <numeric>
## AAACCTGAGAAGGCCT-1 8 14 0.369649
## AAACCTGAGACAGACC-1 8 9 0.297713
## AAACCTGAGGCATGGT-1 10 13 0.407494
## AAACCTGCAAGGTTCT-1 2 10 -0.281533
## AAACCTGCAGGCGATA-1 6 11 0.557694
## ... ... ... ...
## TTTGGTTTCGCTAGCG-1 9 8 0.173418
## TTTGTCACACTTAACG-1 12 5 0.322552
## TTTGTCACAGGTCCAC-1 1 10 -0.451044
## TTTGTCAGTTAAGACA-1 3 11 0.609979
## TTTGTCATCCCAAGAT-1 8 9 0.345747
```

```
sil.data <- as.data.frame(sil.approx)
sil.data$closest <- factor(ifelse(sil.data$width > 0, colLabels(sce.pbmc), sil.data$other))
sil.data$cluster <- colLabels(sce.pbmc)
library(ggplot2)
ggplot(sil.data, aes(x=cluster, y=width, colour=closest)) +
ggbeeswarm::geom_quasirandom(method="smiley")
```

For a more detailed examination, we identify the closest neighboring cluster for each cell in each cluster. In the table below, each row corresponds to one cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster.

```
##
## Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 174 0 0 0 12 0 0 0 0 3 1 12 0 0 0 3
## 2 0 222 0 0 38 0 0 0 0 154 0 35 59 0 0 0
## 3 0 0 538 0 1 0 0 0 0 1 0 1 0 0 0 0
## 4 0 0 0 46 0 0 0 6 2 1 0 0 0 0 1 0
## 5 8 19 0 0 306 0 0 0 0 8 0 33 0 0 0 0
## 6 0 0 0 0 0 117 0 0 8 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 45 0 0 0 0 0 0 1 0 0
## 8 0 0 0 0 0 0 0 417 1 0 0 0 0 14 0 0
## 9 0 0 0 2 1 1 0 69 205 0 0 0 0 0 24 0
## 10 0 8 0 0 0 0 0 0 0 835 0 3 21 0 0 0
## 11 0 0 0 0 0 0 0 0 0 2 45 0 0 0 0 0
## 12 0 0 0 0 1 0 0 0 0 0 0 154 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0 0 0 166 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 84 0
## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16
```

A useful aspect of the silhouette width is that it naturally captures both underclustering and overclustering.
Cells in heterogeneous clusters will have a large average distance to other cells in the same cluster;
all else being equal, this will decrease their widths compared to cells in homogeneous clusters.
Conversely, cells in overclustered datasets will have low average distances to cells in adjacent clusters, again resulting in low widths.
One can exploit this to obtain a “sensible” initial choice for the number of clusters by maximizing the average silhouette width (Section **??**).

### 5.2.3 Cluster purity

Another metric to assess cluster separation is the degree to which cells from multiple clusters intermingle in expression space. The “clustering purity” is defined for each cell as the proportion of neighboring cells that are assigned to the same cluster, after some weighting to adjust for differences in the number of cells between clusteres. Well-separated clusters should exhibit little intermingling and thus high purity values for all member cells.

```
## DataFrame with 3985 rows and 2 columns
## purity maximum
## <numeric> <factor>
## AAACCTGAGAAGGCCT-1 1.000000 8
## AAACCTGAGACAGACC-1 0.891699 8
## AAACCTGAGGCATGGT-1 1.000000 10
## AAACCTGCAAGGTTCT-1 0.518396 2
## AAACCTGCAGGCGATA-1 1.000000 6
## ... ... ...
## TTTGGTTTCGCTAGCG-1 0.907871 9
## TTTGTCACACTTAACG-1 1.000000 12
## TTTGTCACAGGTCCAC-1 1.000000 1
## TTTGTCAGTTAAGACA-1 1.000000 3
## TTTGTCATCCCAAGAT-1 0.900872 8
```

In Figure 5.2, median purity values are consistently greater than 0.9, indicating that most cells in each cluster are primarily surrounded by other cells from the same cluster. Some clusters have low purity values that may warrant more careful inspection - these probably represent closely related subpopulations.

```
pure.data <- as.data.frame(pure.pbmc)
pure.data$maximum <- factor(pure.data$maximum)
pure.data$cluster <- colLabels(sce.pbmc)
ggplot(pure.data, aes(x=cluster, y=purity, colour=maximum)) +
ggbeeswarm::geom_quasirandom(method="smiley")
```

To determine which clusters contaminate each other, we can identify the cluster with the most neighbors for each cell. In the table below, each row corresponds to one cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster.

```
##
## Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 462 0 0 4 0 0 0 0 24 0 9 9 0 0 0
## 3 0 0 541 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 56 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 5 0 0 366 0 0 0 0 0 0 3 0 0 0 0
## 6 0 0 0 0 0 125 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 46 0 0 0 0 0 0 0 0 0
## 8 0 0 0 1 0 0 0 412 9 0 0 0 0 10 0 0
## 9 0 0 0 0 0 2 0 4 291 0 0 0 0 0 5 0
## 10 0 3 0 0 0 0 0 0 0 856 0 0 8 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0 47 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 0 155 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0 0 0 166 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 84 0
## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16
```

The main difference between the cluster purity and silhouette width is that the former ignores the intra-cluster variance. This provides a simpler interpretation of cluster separation; a low silhouette width may still occur in well-separated clusters if the internal heterogeneity is high, while no such complication exists for the cluster purity. Comparing these two metrics can give us an indication of which clusters are heterogeneous, well-separated, or both.

### 5.2.4 Within-cluster sum of squares

The root-mean-squared-deviation (RMSD) for each cluster is defined pretty much as it is named - the root of the mean of the squared differences from the cluster centroid, where the mean is computed across all cells in the cluster. It is closely related to the within-cluster sum of squares (WCSS) and is a natural diagnostic for \(k\)-means clustering, given that the algorithm aims to find a clustering that minimizes the WCSS. However, it can be computed for any set of clusters if the original coordinates are available (Figure 5.3).

```
rmsd <- clusterRMSD(reducedDim(sce.pbmc, "PCA"), colLabels(sce.pbmc))
barplot(rmsd, ylab="RMSD", xlab="Cluster")
```

A large RMSD suggests that a cluster has some internal structure and should be prioritized for subclustering. Of course, this is subject to a number of caveats. Clusters generated from cells with low library sizes will naturally have larger RMSDs due to the greater impact of sequencing noise (Figure 5.4). Immediately adjacent clusters may also have high RMSDs even if they are not heterogeneous. This occurs when there are many cells on the boundaries between clusters, which would result in a higher sum of squares from the centroid.

```
by.clust <- split(sizeFactors(sce.pbmc), colLabels(sce.pbmc))
sf.by.clust <- vapply(by.clust, mean, 0)
plot(rmsd, sf.by.clust, log="xy", pch=16, xlab="RMSD", ylab="Average size factor")
```

### 5.2.5 Using graph modularity

For graph-based clustering, the modularity is a natural metric for evaluating the separation between communities. This is defined as the (scaled) difference between the observed total weight of edges between nodes in the same cluster and the expected total weight if edge weights were randomly distributed across all pairs of nodes. Larger modularity values indicate that there most edges occur within clusters, suggesting that the clusters are sufficiently well separated to avoid edges forming between neighboring cells in different clusters.

The standard approach is to report a single modularity value for a clustering on a given graph.
This is useful for comparing different clusterings on the same graph - and indeed, some community detection algorithms are designed with the aim of maximizing the modularity - but it is less helpful for interpreting a given clustering.
Rather, we use the `pairwiseModularity()`

function from *bluster* with `as.ratio=TRUE`

, which returns the ratio of the observed to expected sum of weights between each pair of clusters.
We use the ratio instead of the difference as the former is less dependent on the number of cells in each cluster.

```
g <- nn.clust$objects$graph
ratio <- pairwiseModularity(g, colLabels(sce.pbmc), as.ratio=TRUE)
dim(ratio)
```

`## [1] 16 16`

In this matrix, each row/column corresponds to a cluster and each entry contains the ratio of the observed to total weight of edges between cells in the respective clusters. A dataset containing well-separated clusters should contain most of the observed total weight on the diagonal entries, i.e., most edges occur between cells in the same cluster. Indeed, concentration of the weight on the diagonal of (Figure 5.5) indicates that most of the clusters are well-separated, while some modest off-diagonal entries represent closely related clusters with more inter-connecting edges.

```
library(pheatmap)
pheatmap(log2(ratio+1), cluster_rows=FALSE, cluster_cols=FALSE,
color=colorRampPalette(c("white", "blue"))(100))
```

One useful approach is to use the `ratio`

matrix to form another graph where the nodes are clusters rather than cells.
Edges between nodes are weighted according to the ratio of observed to expected edge weights between cells in those clusters.
We can then repeat our graph operations on this new cluster-level graph to explore the relationships between clusters.
For example, we could obtain clusters of clusters, or we could simply create a new cluster-based layout for visualization (Figure 5.6).
This is analogous to the “graph abstraction” approach described by Wolf et al. (2017), which can be used to identify trajectories in the data based on high-weight paths between clusters.

```
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1),
mode="upper", weighted=TRUE, diag=FALSE)
# Increasing the weight to increase the visibility of the lines.
set.seed(11001010)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
layout=igraph::layout_with_lgl)
```

Incidentally, some readers may have noticed that all *igraph* commands were prefixed with `igraph::`

.
We have done this deliberately to avoid bringing `igraph::normalize`

into the global namespace.
Rather unfortunately, this `normalize`

function accepts any argument and returns `NULL`

, which causes difficult-to-diagnose bugs when it overwrites `normalize`

from *BiocGenerics*.

## 5.3 Comparing different clusterings

### 5.3.1 Motivation

As previously mentioned, clustering’s main purpose is to obtain a discrete summary of the data for further interpretation. The diversity of available methods (and the subsequent variation in the clustering results) reflects the many different “perspectives” that can be derived from a high-dimensional scRNA-seq dataset. It is helpful to determine how these perspectives relate to each other by comparing the clustering results. More concretely, we want to know which clusters map to each other across algorithms; inconsistencies may be indicative of complex variation that is summarized differently by each clustering procedure. To illustrate, we will consider different variants of graph-based clustering on our PBMC dataset.

### 5.3.2 Identifying corresponding clusters

The simplest approach for comparing two clusterings of the same dataset is to create a 2-dimensional table of label frequencies. We can then identify how our Walktrap clusters are redistributed when we switch to using Louvain community detection (Figure 5.7). Multiple non-zero entries along a row or column indicate that multiple clusters in one clustering are merged in the other clustering.

```
tab <- table(Walktrap=clust.walktrap, Louvain=clust.louvain)
rownames(tab) <- paste("Walktrap", rownames(tab))
colnames(tab) <- paste("Louvain", colnames(tab))
library(pheatmap)
pheatmap(log10(tab+10), color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)
```

A more sophisticated approach involves computing the Jaccard index for each pair of clusters. This normalizes for the differences in cluster abundance so that large clusters do not dominate the color scale (Figure 5.8). Jaccard indices of 1 represent a perfect 1:1 mapping between a pair of clusters.

```
jacc.mat <- linkClustersMatrix(clust.walktrap, clust.louvain)
rownames(jacc.mat) <- paste("Walktrap", rownames(jacc.mat))
colnames(jacc.mat) <- paste("Louvain", colnames(jacc.mat))
pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)
```

We identify the best corresponding clusters based on the largest Jaccard index along each row. The magnitude of the index can be used as a measure of strength for the correspondence between those two clusters. A low index for a cluster indicates that no counterpart exists in the other clustering.

```
best <- max.col(jacc.mat, ties.method="first")
DataFrame(
Cluster=rownames(jacc.mat),
Corresponding=colnames(jacc.mat)[best],
Index=jacc.mat[cbind(seq_len(nrow(jacc.mat)), best)]
)
```

```
## DataFrame with 16 rows and 3 columns
## Cluster Corresponding Index
## <character> <character> <numeric>
## 1 Walktrap 1 Louvain 9 0.965854
## 2 Walktrap 2 Louvain 3 0.606343
## 3 Walktrap 3 Louvain 6 1.000000
## 4 Walktrap 4 Louvain 5 0.946429
## 5 Walktrap 5 Louvain 12 0.923077
## ... ... ... ...
## 12 Walktrap 12 Louvain 7 0.9308176
## 13 Walktrap 13 Louvain 13 0.5674740
## 14 Walktrap 14 Louvain 1 0.0922844
## 15 Walktrap 15 Louvain 11 0.3835616
## 16 Walktrap 16 Louvain 8 0.1379310
```

### 5.3.3 Visualizing differences

The `linkClusters()`

function constructs a graph where nodes are clusters from multiple clusterings and edges are formed between corresponding clusters.
More specifically, the weight of each edge is defined from the number of shared cells, with the Jaccard index being the potential weighting scheme used when `denominator="union"`

.
The idea is to obtain “clusters of clusters”, allowing us to visualize the relationships between clusters from two or more clusterings (Figure 5.9).
As the name suggests, `linkClusters()`

is closely related to the output of the `linkClustersMatrix()`

function;
indeed, the latter simply returns part of the adjacency matrix used to create the graph in the former.

```
clust.infomap <- clusterCells(sce.pbmc, use.dimred="PCA",
BLUSPARAM=NNGraphParam(cluster.fun="infomap"))
linked <- linkClusters(list(
walktrap=clust.walktrap,
louvain=clust.louvain,
infomap=clust.infomap)
)
linked
```

```
## IGRAPH 7b2ca88 UNW- 64 150 --
## + attr: name (v/c), weight (e/n)
## + edges from 7b2ca88 (vertex names):
## [1] walktrap.4 --louvain.1 walktrap.6 --louvain.1 walktrap.8 --louvain.1
## [4] walktrap.9 --louvain.1 walktrap.14--louvain.1 walktrap.2 --louvain.2
## [7] walktrap.10--louvain.2 walktrap.13--louvain.2 walktrap.2 --louvain.3
## [10] walktrap.5 --louvain.3 walktrap.10--louvain.3 walktrap.6 --louvain.4
## [13] walktrap.9 --louvain.4 walktrap.4 --louvain.5 walktrap.3 --louvain.6
## [16] walktrap.2 --louvain.7 walktrap.12--louvain.7 walktrap.1 --louvain.8
## [19] walktrap.2 --louvain.8 walktrap.9 --louvain.8 walktrap.10--louvain.8
## [22] walktrap.11--louvain.8 walktrap.16--louvain.8 walktrap.1 --louvain.9
## + ... omitted several edges
```

For clusterings that differ primarily in resolution (usually from different parameterizations of the same algorithm),
we can use the *clustree* package to visualize the relationships between them.
Here, the aim is to capture the redistribution of cells from one clustering to another at progressively higher resolution,
providing a convenient depiction of how clusters split apart (Figure 5.10).
This approach is most effective when the clusterings exhibit a clear gradation in resolution
but is less useful for comparisons involving theoretically distinct clustering procedures.

```
clust.5 <- clusterCells(sce.pbmc, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=5))
clust.10 <- clusterCells(sce.pbmc, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=10))
clust.50 <- clusterCells(sce.pbmc, use.dimred="PCA", BLUSPARAM=NNGraphParam(k=50))
combined <- cbind(k.5=clust.5, k.10=clust.10, k.50=clust.50)
library(clustree)
set.seed(1111)
clustree(combined, prefix="k.", edge_arrow=FALSE)
```

### 5.3.4 Adjusted Rand index

We can quantify the agreement between two clusterings by computing the Rand index with *bluster*’s `pairwiseRand()`

.
This is defined as the proportion of pairs of cells that retain the same status (i.e., both cells in the same cluster, or each cell in different clusters) in both clusterings.
In practice, we usually compute the adjusted Rand index (ARI) where we subtract the number of concordant pairs expected under random permutations of the clusterings; this accounts for differences in the size and number of clusters within and between clusterings.
A larger ARI indicates that the clusters are preserved, up to a maximum value of 1 for identical clusterings.
In and of itself, the magnitude of the ARI has little meaning, and it is best used to assess the relative similarities of different clusterings (e.g., “Walktrap is more similar to Louvain than either are to Infomap”).
Nonetheless, if one must have a hard-and-fast rule, experience suggests that an ARI greater than 0.5 corresponds to “good” similarity between two clusterings.

`## [1] 0.7796`

The same function can also provide a more granular perspective with `mode="ratio"`

,
where the ARI is broken down into its contributions from each pair of clusters in one of the clusterings.
This mode is helpful if one of the clusterings - in this case, `clust`

- is considered to be a “reference”,
and the aim is to quantify the extent to which the reference clusters retain their integrity in another clustering.
In the `breakdown`

matrix, each entry is a ratio of the adjusted number of concoordant pairs to the adjusted total number of pairs.
Low values on the diagonal in Figure 5.11 indicate that cells from the corresponding reference cluster in `clust`

are redistributed to multiple other clusters in `clust.5`

.
Conversely, low off-diagonal values indicate that the corresponding pair of reference clusters are merged together in `clust.5`

.

```
breakdown <- pairwiseRand(ref=clust.10, alt=clust.5, mode="ratio")
pheatmap(breakdown, color=viridis::magma(100),
cluster_rows=FALSE, cluster_cols=FALSE)
```

## 5.4 Evaluating cluster stability

A desirable property of a given clustering is that it is stable to perturbations to the input data (Von Luxburg 2010).
Stable clusters are logistically convenient as small changes to upstream processing will not change the conclusions;
greater stability also increases the likelihood that those conclusions can be reproduced in an independent replicate study.
*scran* uses bootstrapping to evaluate the stability of a clustering algorithm on a given dataset - that is, cells are sampled with replacement to create a “bootstrap replicate” dataset, and clustering is repeated on this replicate to see if the same clusters can be reproduced.
We demonstrate below for graph-based clustering on the PCs of the PBMC dataset.

```
myClusterFUN <- function(x) {
g <- bluster::makeSNNGraph(x, type="jaccard")
igraph::cluster_louvain(g)$membership
}
pcs <- reducedDim(sce.pbmc, "PCA")
originals <- myClusterFUN(pcs)
table(originals) # inspecting the cluster sizes.
```

```
## originals
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 459 848 127 56 541 343 290 48 200 45 124 233 372 94 62 143
```

```
set.seed(0010010100)
ratios <- bootstrapStability(pcs, FUN=myClusterFUN, clusters=originals)
dim(ratios)
```

`## [1] 16 16`

The function returns a matrix of ARI-derived ratios for every pair of original clusters in `originals`

(Figure 5.12), averaged across bootstrap iterations.
High ratios indicate that the clustering in the bootstrap replicates are highly consistent with that of the original dataset.
More specifically, high ratios on the diagonal indicate that cells in the same original cluster are still together in the bootstrap replicates, while high ratios off the diagonal indicate that cells in the corresponding cluster pair are still separted.

```
pheatmap(ratios, cluster_row=FALSE, cluster_col=FALSE,
color=viridis::magma(100), breaks=seq(-1, 1, length.out=101))
```

Bootstrapping is a general approach for evaluating cluster stability that is compatible with any clustering algorithm. The ARI-derived ratio between cluster pairs is also more informative than a single stability measure for all/each cluster as the former considers the relationships between clusters, e.g., unstable separation between \(X\) and \(Y\) does not penalize the stability of separation between \(X\) and another cluster \(Z\). However, one should take these metrics with a grain of salt, as bootstrapping only considers the effect of sampling noise and ignores other factors that affect reproducibility in an independent study (e.g., batch effects, donor variation). In addition, it is possible for a poor separation to be highly stable, so highly stable cluster may not necessarily represent some distinct subpopulation.

## 5.5 Clustering parameter sweeps

The `clusterSweep()`

function allows us to quickly apply our clustering with a range of different parameters.
For example, we can iterate across different combinations of `k`

and community detection algorithms for graph-based clustering.
We could then use `linkClusters()`

, *clustree* or similar functions to visualize the relationships between different clusterings.

```
# Parallelizing for some speed.
out <- clusterSweep(reducedDim(sce.pbmc, "PCA"),
NNGraphParam(),
k=as.integer(c(5, 10, 15, 20, 25, 30, 35, 40)),
cluster.fun=c("louvain", "walktrap", "infomap"),
BPPARAM=BiocParallel::MulticoreParam(8))
colnames(out$clusters)
```

```
## [1] "k.5_cluster.fun.louvain" "k.10_cluster.fun.louvain"
## [3] "k.15_cluster.fun.louvain" "k.20_cluster.fun.louvain"
## [5] "k.25_cluster.fun.louvain" "k.30_cluster.fun.louvain"
## [7] "k.35_cluster.fun.louvain" "k.40_cluster.fun.louvain"
## [9] "k.5_cluster.fun.walktrap" "k.10_cluster.fun.walktrap"
## [11] "k.15_cluster.fun.walktrap" "k.20_cluster.fun.walktrap"
## [13] "k.25_cluster.fun.walktrap" "k.30_cluster.fun.walktrap"
## [15] "k.35_cluster.fun.walktrap" "k.40_cluster.fun.walktrap"
## [17] "k.5_cluster.fun.infomap" "k.10_cluster.fun.infomap"
## [19] "k.15_cluster.fun.infomap" "k.20_cluster.fun.infomap"
## [21] "k.25_cluster.fun.infomap" "k.30_cluster.fun.infomap"
## [23] "k.35_cluster.fun.infomap" "k.40_cluster.fun.infomap"
```

```
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 523 302 125 45 172 573 249 439 293 95 772 142 38 18 62 38 30 16 15 9
## 21 22
## 16 13
```

Parameter sweeps are particularly useful when combined with one of the metrics described in Section 5.2. This provides a high-level summary of the behavior of the clustering method as the parameters change, as demonstrated in Figure 5.13. We can then make some decisions on which clustering(s) to use for further analyses. For example, we might choose a few clusterings that span several different resolutions, so as to obtain a greater diversity of summaries of the data; conversely, we might be able to save some time and effort by ignoring redundant clusterings with similar values for our metrics.

```
df <- as.data.frame(out$parameters)
df$num.clusters <- vapply(as.list(out$clusters), function(cluster) {
length(unique(cluster))
}, 0L)
all.sil <- lapply(as.list(out$clusters), function(cluster) {
sil <- approxSilhouette(reducedDim(sce.pbmc), cluster)
mean(sil$width)
})
df$silhouette <- unlist(all.sil)
all.wcss <- lapply(as.list(out$clusters), function(cluster) {
sum(clusterRMSD(reducedDim(sce.pbmc), cluster, sum=TRUE), na.rm=TRUE)
})
df$wcss <- unlist(all.wcss)
library(ggplot2)
gridExtra::grid.arrange(
ggplot(df, aes(x=k, y=num.clusters, group=cluster.fun, color=cluster.fun)) +
geom_line(lwd=2) + scale_y_log10(),
ggplot(df, aes(x=k, y=silhouette, group=cluster.fun, color=cluster.fun)) +
geom_line(lwd=2),
ggplot(df, aes(x=k, y=wcss, group=cluster.fun, color=cluster.fun)) +
geom_line(lwd=2),
ncol=3
)
```

We could even use the sweep to automatically choose the “best” clustering by optimizing one or more of these metrics. The simplest strategy is to maximize the silhouette width, though one can imagine more complex scores involving combinations of metrics. This approach is valid but any automatic choice should be treated as a suggestion rather than a rule. As previously discussed, the clustering at the optimal value of a metric may not be the most scientifically informative clustering, given that well-separated clusters typically correspond to cell types that are already known.

## 5.6 Agglomerating graph communities

Some community detection algorithms operate by agglomeration and thus can be used to construct a hierarchical dendrogram based on the pattern of merges between clusters.
The dendrogram itself is not particularly informative as it simply describes the order of merge steps performed by the algorithm; unlike the dendrograms produced by hierarchical clustering (Basic Section 5.4), it does not capture the magnitude of differences between subpopulations.
However, it does provide a convenient avenue for manually tuning the clustering resolution by generating nested clusterings using the `cut_at()`

function, as shown below.

```
##
## 1 2 3 4 5
## 3546 221 125 46 47
```

```
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 462 374 125 437 46 432 302 173 867 47 155 166 104 40 61 84 46 32 16 16
```

If `cut_at()`

-like functionality is desired for non-hierarchical methods, *bluster* provides a `mergeCommunities()`

function to retrospectively tune the clustering resolution.
This function will greedily merge pairs of clusters until a specified number of clusters is achieved, where pairs are chosen to maximize the modularity at each merge step.

```
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 661 857 353 126 53 541 152 116 198 46 219 376 287
```

```
## Error in igraph::cut_at(community.louvain, n = 10) :
## Not a hierarchical communitity structure
```

```
## merged
## 1 2 3 4 6 9 10 11 12 13
## 661 857 353 126 541 198 162 272 528 287
```

## Session Info

```
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] clustree_0.4.4 ggraph_2.0.5
[3] pheatmap_1.0.12 ggplot2_3.3.5
[5] bluster_1.4.0 scran_1.22.0
[7] scuttle_1.4.0 SingleCellExperiment_1.16.0
[9] SummarizedExperiment_1.24.0 Biobase_2.54.0
[11] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
[13] IRanges_2.28.0 S4Vectors_0.32.2
[15] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[17] matrixStats_0.61.0 BiocStyle_2.22.0
[19] rebook_1.4.0
loaded via a namespace (and not attached):
[1] ggbeeswarm_0.6.0 colorspace_2.0-2
[3] ellipsis_0.3.2 XVector_0.34.0
[5] BiocNeighbors_1.12.0 farver_2.1.0
[7] graphlayouts_0.7.1 ggrepel_0.9.1
[9] fansi_0.5.0 codetools_0.2-18
[11] sparseMatrixStats_1.6.0 knitr_1.36
[13] polyclip_1.10-0 jsonlite_1.7.2
[15] cluster_2.1.2 graph_1.72.0
[17] ggforce_0.3.3 BiocManager_1.30.16
[19] compiler_4.1.2 backports_1.3.0
[21] dqrng_0.3.0 assertthat_0.2.1
[23] Matrix_1.3-4 fastmap_1.1.0
[25] limma_3.50.0 tweenr_1.0.2
[27] BiocSingular_1.10.0 htmltools_0.5.2
[29] tools_4.1.2 rsvd_1.0.5
[31] igraph_1.2.8 gtable_0.3.0
[33] glue_1.5.0 GenomeInfoDbData_1.2.7
[35] dplyr_1.0.7 rappdirs_0.3.3
[37] Rcpp_1.0.7 jquerylib_0.1.4
[39] vctrs_0.3.8 DelayedMatrixStats_1.16.0
[41] xfun_0.28 stringr_1.4.0
[43] beachmat_2.10.0 lifecycle_1.0.1
[45] irlba_2.3.3 statmod_1.4.36
[47] XML_3.99-0.8 edgeR_3.36.0
[49] zlibbioc_1.40.0 MASS_7.3-54
[51] scales_1.1.1 tidygraph_1.2.0
[53] parallel_4.1.2 RColorBrewer_1.1-2
[55] yaml_2.2.1 gridExtra_2.3
[57] sass_0.4.0 stringi_1.7.5
[59] highr_0.9 ScaledMatrix_1.2.0
[61] checkmate_2.0.0 filelock_1.0.2
[63] BiocParallel_1.28.0 rlang_0.4.12
[65] pkgconfig_2.0.3 bitops_1.0-7
[67] evaluate_0.14 lattice_0.20-45
[69] purrr_0.3.4 CodeDepends_0.6.5
[71] labeling_0.4.2 tidyselect_1.1.1
[73] magrittr_2.0.1 bookdown_0.24
[75] R6_2.5.1 generics_0.1.1
[77] metapod_1.2.0 DelayedArray_0.20.0
[79] DBI_1.1.1 pillar_1.6.4
[81] withr_2.4.2 RCurl_1.98-1.5
[83] tibble_3.1.6 dir.expiry_1.2.0
[85] crayon_1.4.2 utf8_1.2.2
[87] rmarkdown_2.11 viridis_0.6.2
[89] locfit_1.5-9.4 grid_4.1.2
[91] digest_0.6.28 tidyr_1.1.4
[93] munsell_0.5.0 beeswarm_0.4.0
[95] viridisLite_0.4.0 vipor_0.4.5
[97] bslib_0.3.1
```

### References

Von Luxburg, U. 2010. “Clustering Stability: An Overview.” *Foundations and Trends in Machine Learning* 2 (3): 235–74.

Wolf, F. Alexander, Fiona Hamey, Mireya Plass, Jordi Solana, Joakim S. Dahlin, Berthold Gottgens, Nikolaus Rajewsky, Lukas Simon, and Fabian J. Theis. 2017. “Graph Abstraction Reconciles Clustering with Trajectory Inference Through a Topology Preserving Map of Single Cells.” *bioRxiv*. https://doi.org/10.1101/208819.