Contents

1 Introduction

The bluster package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows. At its core is the clusterRows() generic that controls dispatch to different clustering algorithms. We will demonstrate on some single-cell RNA sequencing data from the scRNAseq package; our aim is to cluster cells into cell populations based on their PC coordinates.

library(scRNAseq)
sce <- ZeiselBrainData()

# Trusting the authors' quality control, and going straight to normalization.
library(scuttle)
sce <- logNormCounts(sce)

# Feature selection based on highly variable genes.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction for work (PCA) and pleasure (t-SNE).
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
sce <- runUMAP(sce, dimred="PCA")

mat <- reducedDim(sce, "PCA")
dim(mat)
## [1] 3005   20

2 Hierarchical clustering

Our first algorithm is good old hierarchical clustering, as implemented using hclust() from the stats package. This automatically sets the cut height to half the dendrogram height.

library(bluster)
hclust.out <- clusterRows(mat, HclustParam())
plotUMAP(sce, colour_by=I(hclust.out))

Advanced users can achieve greater control of the procedure by passing more parameters to the HclustParam() constructor. Here, we use Wardโ€™s criterion for the agglomeration with a dynamic tree cut from the dynamicTreeCut package.

hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE)
hp2
## class: HclustParam
## metric: euclidean
## method: ward.D2
## cut.fun: cutreeDynamic
## cut.params(0):
hclust.out <- clusterRows(mat, hp2)
plotUMAP(sce, colour_by=I(hclust.out))

3 \(k\)-means clustering

Our next algorithm is \(k\)-means clustering, as implemented using the kmeans() function. This requires us to pass in the number of clusters, either as a number:

set.seed(100)
kmeans.out <- clusterRows(mat, KmeansParam(10))
plotUMAP(sce, colour_by=I(kmeans.out))