Contents

1 Introduction

Single-Cell Consensus Clustering (SC3) is a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. An interactive graphical implementation makes SC3 accessible to a wide audience of users. In addition, SC3 also aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. A manuscript describing SC3 in details is published in Nature Methods.

2 SingleCellExperiment, QC and scater

SC3 is a purely clustering tool and it does not provide functions for the sequencing quality control (QC) or normalisation. On the contrary it is expected that these preprocessing steps are performed by a user in advance. To encourage the preprocessing, SC3 is built on top of the Bioconductor’s SingleCellExperiment class and uses functionality of scater package for QC.

3 Quick Start

3.1 SC3 Input

If you already have a SingleCellExperiment object created and QCed using scater then proceed to the next chapter.

If you have a matrix containing expression data that was QCed and normalised by some other tool, then we first need to form an SingleCellExperiment object containing the data. For illustrative purposes we will use an example expression matrix provided with SC3. The dataset (yan) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann data frame). The rows in the yan dataset correspond to genes and columns correspond to cells.

library(SingleCellExperiment)
library(SC3)
library(scater)

head(ann)
##                 cell_type1
## Oocyte..1.RPKM.     zygote
## Oocyte..2.RPKM.     zygote
## Oocyte..3.RPKM.     zygote
## Zygote..1.RPKM.     zygote
## Zygote..2.RPKM.     zygote
## Zygote..3.RPKM.     zygote
yan[1:3, 1:3]
##          Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152             0.0             0.0             0.0
## RPS11             1219.9          1021.1           931.6
## ELMO2                7.0            12.2             9.3

The ann dataframe contains just cell_type1 column which correspond to the cell labels provided by authors of the original publication. Note that in general it can also contain more information about the cells, such as plate, run, well, date etc.

Now we can create a SingleCellExperiment object from yan expression matrix.

Note that SC3 requires both counts and logcounts slots to exist in the input SingleCellExperiment object. The counts slot is used for gene filtering, which is based on gene dropout rates. logcounts slot, which is supposed to contain both normalised and log-transformed expression matrix, is used in the main clustering algorithm. In the case of the yan dataset even though the counts are not available (we only have FPKM values) we can use the FPKM values for gene dropout rate calculations since FPKM normalisation does not change the dropout rate.

SC3 also requires the feature_symbol column of the rowData slot of the input SingleCellExperiment object to contain preferable feature names (genes/transcript) which will be used in the futher visualisations.

Additionally, if spike-ins are defined via isSpike function, SC3 will automatically remove them before doing clustering:

# create a SingleCellExperiment object
sce <- SingleCellExperiment(
    assays = list(
        counts = as.matrix(yan),
        logcounts = log2(as.matrix(yan) + 1)
    ), 
    colData = ann
)

# define feature names in feature_symbol column
rowData(sce)$feature_symbol <- rownames(sce)
# remove features with duplicated names
sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]

# define spike-ins
isSpike(sce, "ERCC") <- grepl("ERCC", rowData(sce)$feature_symbol)

scater allows a user to quickly visualize and assess any SingleCellExperiment object, for example using a PCA plot:

plotPCA(sce, colour_by = "cell_type1")

3.2 Run SC3

If you would like to explore clustering of your data in the range of ks (the number of clusters) from 2 to 4, you just need to run the main sc3 method and define the range of ks using the ks parameter (here we also ask SC3 to calculate biological features based on the identified cell clusters):

sce <- sc3(sce, ks = 2:4, biology = TRUE)
## Setting SC3 parameters...
## Calculating distances between the cells...
## Performing transformations and calculating eigenvectors...
## Performing k-means clustering...
## Calculating consensus matrix...
## Calculating biology...

By default SC3 will use all but one cores of your machine. You can manually set the number of cores to be used by setting the n_cores parameter in the sc3 call.

To quickly and easily explore the SC3 solutions using an interactive Shiny application use the following method:

sc3_interactive(sce)

Visual exploration can provide a reasonable estimate of the number of clusters k. Once a preferable k is chosen it is also possible to export the results into an Excel file:

sc3_export_results_xls(sce)

This will write all results to sc3_results.xls file. The name of the file can be controlled by the filename parameter.

3.3 colData

SC3 writes all its results obtained for cells to the colData slot of the sce object by adding additional columns to it. This slot also contains all other cell features calculated by the scater package either automatically during the sce object creation or during the calculateQCMetrics call. One can identify the SC3 results using the "sc3_" prefix:

col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])
## DataFrame with 6 rows and 6 columns
##                 sc3_2_clusters sc3_3_clusters sc3_4_clusters
##                       <factor>       <factor>       <factor>
## Oocyte..1.RPKM.              2              2              2
## Oocyte..2.RPKM.              2              2              2
## Oocyte..3.RPKM.              2              2              2
## Zygote..1.RPKM.              2              2              2
## Zygote..2.RPKM.              2              2              2
## Zygote..3.RPKM.              2              2              2
##                 sc3_2_log2_outlier_score sc3_3_log2_outlier_score
##                                <numeric>                <numeric>
## Oocyte..1.RPKM.                        0         1.67032836742406
## Oocyte..2.RPKM.                        0         1.69878936817052
## Oocyte..3.RPKM.                        0         1.16603348178042
## Zygote..1.RPKM.                        0                        0
## Zygote..2.RPKM.                        0                        0
## Zygote..3.RPKM.                        0                        0
##                 sc3_4_log2_outlier_score
##                                <numeric>
## Oocyte..1.RPKM.         1.67032836742396
## Oocyte..2.RPKM.         1.69878936817042
## Oocyte..3.RPKM.          1.1660334817803
## Zygote..1.RPKM.                        0
## Zygote..2.RPKM.                        0
## Zygote..3.RPKM.                        0

Additionally, having SC3 results stored in the same slot makes it possible to highlight them during any of the scater’s plotting function call, for example:

plotPCA(
    sce, 
    colour_by = "sc3_3_clusters", 
    size_by = "sc3_3_log2_outlier_score"
)

3.4 rowData

SC3 writes all its results obtained for features (genes/transcripts) to the rowData slot of the sce object by adding additional columns to it. This slot also contains all other feature values calculated by the scater package either automatically during the sce object creation or during the calculateQCMetrics call. One can identify the SC3 results using the "sc3_" prefix:

row_data <- rowData(sce)
head(row_data[ , grep("sc3_", colnames(row_data))])
## DataFrame with 6 rows and 13 columns
##   sc3_gene_filter sc3_2_markers_clusts   sc3_2_markers_padj
##         <logical>            <numeric>            <numeric>
## 1           FALSE                   NA                   NA
## 2           FALSE                   NA                   NA
## 3            TRUE                    2 3.42891755294448e-06
## 4            TRUE                    2                    1
## 5           FALSE                   NA                   NA
## 6            TRUE                    1                    1
##   sc3_2_markers_auroc sc3_3_markers_clusts   sc3_3_markers_padj
##             <numeric>            <numeric>            <numeric>
## 1                  NA                   NA                   NA
## 2                  NA                   NA                   NA
## 3   0.905833333333333                    2 8.74957155809462e-08
## 4   0.635833333333333                    2 0.000364998858198823
## 5                  NA                   NA                   NA
## 6   0.549722222222222                    1                    1
##   sc3_3_markers_auroc sc3_4_markers_clusts   sc3_4_markers_padj
##             <numeric>            <numeric>            <numeric>
## 1                  NA                   NA                   NA
## 2                  NA                   NA                   NA
## 3    0.96969696969697                    2 8.97459044385902e-08
## 4   0.827020202020202                    2  0.00038529310418705
## 5                  NA                   NA                   NA
## 6   0.549722222222222                    3                    1
##   sc3_4_markers_auroc        sc3_2_de_padj        sc3_3_de_padj
##             <numeric>            <numeric>            <numeric>
## 1                  NA                   NA                   NA
## 2                  NA                   NA                   NA
## 3    0.96969696969697 3.33654105464895e-06 7.88333987242478e-10
## 4   0.827020202020202                    1  0.00203864534241844
## 5                  NA                   NA                   NA
## 6   0.543928571428571                    1                    1
##          sc3_4_de_padj
##              <numeric>
## 1                   NA
## 2                   NA
## 3 1.86540162226264e-09
## 4  0.00616104934849165
## 5                   NA
## 6                    1

Because the biological features were also calculated for each k, one can find ajusted p-values for both differential expression and marker genes, as well as the area under the ROC curve values (see ?sc3_calc_biology for more information).

4 Number of Сells

The default settings of SC3 allow to cluster (using a single k) a dataset of 2,000 cells in about 20-30 minutes.

For datasets with more than 2,000 cells SC3 automatically adjusts some of its parameters (see below). This allows to cluster a dataset of 5,000 cells in about 20-30 minutes. The parameters can also be manually adjusted for datasets with any number of cells.

For datasets with more than 5,000 cells SC3 utilizes a hybrid approach that combines unsupervised and supervised clusterings (see below). Namely, SC3 selects a subset of cells uniformly at random, and obtains clusters from this subset. Subsequently, the inferred labels are used to train a Support Vector Machine (SVM), which is employed to assign labels to the remaining cells. Training cells can also be manually selected by providing their indeces.

5 Plot Functions

SC3 also provides methods for plotting all figures from the interactive session.

5.1 Consensus Matrix

The consensus matrix is a N by N matrix, where N is the number of cells in the input dataset. It represents similarity between the cells based on the averaging of clustering results from all combinations of clustering parameters. Similarity 0 (blue) means that the two cells are always assigned to different clusters. In contrast, similarity 1 (red) means that the two cells are always assigned to the same cluster. The consensus matrix is clustered by hierarchical clustering and has a diagonal-block structure. Intuitively, the perfect clustering is achieved when all diagonal blocks are completely red and all off-diagonal elements are completely blue.

sc3_plot_consensus(sce, k = 3)