1 Introduction

SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference set. Specifically, for each test cell:

  1. We compute the Spearman correlation between its expression profile and that of each reference sample.
  2. We define the per-label score as a fixed quantile (by default, 0.8) of the distribution of correlations.
  3. We repeat this for all labels and we take the label with the highest score as the annotation for this cell.
  4. We optionally perform a fine-tuning step:
  • The reference dataset is subsetted to only include labels with scores close to the maximum.
  • Scores are recomputed using only marker genes for the subset of labels.
  • This is iterated until one label remains.

Automatic annotation provides a convenient way of transferring biological knowledge across datasets. In this manner, the burden of interpreting clusters and defining marker genes only has to be done once.

2 Using the built-in references

SingleR provides several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated data retrieval functions. For example, to obtain reference data from the Human Primary Cell Atlas:

library(SingleR)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(2): label.main label.fine

We use this reference in the SingleR() function to annotate a scRNA-seq dataset from La Manno et al. (2016). For the sake of speed, we will only label the first 100 cells from this dataset.

library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

# Restrict to common genes between test and reference data:
library(scater)
common <- intersect(rownames(hESCs), rownames(hpca.se))
hpca.se <- hpca.se[common,]
hESCs <- hESCs[common,]
hESCs <- logNormCounts(hESCs)

pred.hpca <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
pred.hpca
## DataFrame with 100 rows and 5 columns
##                                                                     scores
##                                                                   <matrix>
## 1772122_301_C02  0.118426779945786:0.179699807625087:0.157326274226517:...
## 1772122_180_E05  0.129708246318855:0.236277439793527:0.202370888668263:...
## 1772122_300_H02  0.158201338525345:0.250060222727419:0.211831550178353:...
## 1772122_180_B09   0.158778546217777:0.27716592787528:0.222681369744636:...
## 1772122_180_G04   0.138505219642345:0.236658649096383:0.19092437361406:...
## ...                                                                    ...
## 1772122_299_E07  0.145931041885859:0.241153701803065:0.217382763112476:...
## 1772122_180_D02  0.122983434596168:0.239181076829949:0.181221997276501:...
## 1772122_300_D09  0.129757310468164:0.233775092572195:0.196637664917917:...
## 1772122_298_F09  0.143118885460347:0.262267367714562:0.214329641867196:...
## 1772122_302_A11 0.0912854247387272:0.185945405472165:0.139232371863794:...
##                         first.labels                         tuning.scores
##                          <character>                           <DataFrame>
## 1772122_301_C02 Neuroepithelial_cell   0.18244020296249:0.0991115652997192
## 1772122_180_E05 Neuroepithelial_cell  0.137548373236792:0.0647133734667384
## 1772122_300_H02 Neuroepithelial_cell   0.275798157639906:0.136969040146444
## 1772122_180_B09 Neuroepithelial_cell 0.0851622797320583:0.0819878452425098
## 1772122_180_G04 Neuroepithelial_cell   0.198841544187094:0.101662168246495
## ...                              ...                                   ...
## 1772122_299_E07 Neuroepithelial_cell  0.176002520599547:0.0922503823656398
## 1772122_180_D02 Neuroepithelial_cell   0.196760862365318:0.112480486219438
## 1772122_300_D09 Neuroepithelial_cell 0.0816424287822026:0.0221368018363302
## 1772122_298_F09 Neuroepithelial_cell  0.187249853552379:0.0671892835266423
## 1772122_302_A11 Neuroepithelial_cell   0.156079956344163:0.105132159755961
##                               labels        pruned.labels
##                          <character>          <character>
## 1772122_301_C02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05              Neurons              Neurons
## 1772122_300_H02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 Neuroepithelial_cell Neuroepithelial_cell
## ...                              ...                  ...
## 1772122_299_E07 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_D02 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_300_D09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_298_F09 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_302_A11            Astrocyte            Astrocyte

Each row of the output DataFrame contains prediction results for a single cell. Labels are shown before fine-tuning (first.labels), after fine-tuning (labels) and after pruning (pruned.labels), along with the associated scores. We summarize the distribution of labels across our subset of cells: [6~

table(pred.hpca$labels)
## 
##            Astrocyte Neuroepithelial_cell              Neurons 
##                   14                   81                    5

At this point, it is worth noting that SingleR is workflow/package agnostic. The above example uses SummarizedExperiment objects, but the same functions will accept any (log-)normalized expression matrix.

3 Using single-cell references

3.1 Setting up the data

Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.

library(scRNAseq)
sceM <- MuraroPancreasData()

# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]
table(sceM$label)
## 
##      acinar       alpha        beta       delta        duct endothelial 
##         219         812         448         193         245          21 
##     epsilon mesenchymal          pp     unclear 
##           3          80         101           4
sceM <- logNormCounts(sceM)

We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.

sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0]
sceG <- logNormCounts(sceG) 
sceG <- sceG[,1:100]

We then restrict to common genes:

common <- intersect(rownames(sceM), rownames(sceG))
sceM <- sceM[common,]
sceG <- sceG[common,]

3.2 Defining custom markers

The default marker definition in SingleR() is intended for references derived from bulk RNA-seq data. When using single-cell data as a reference, we suggest building your own marker list. This involves a series of pairwise comparisons between labels to define markers that distinguish each label from another, and is easy to perform with functions from scran. For example, we can perform pairwise \(t\)-tests and obtain the top 10 marker genes from each pairwise comparison.

library(scran)
out <- pairwiseTTests(logcounts(sceM), sceM$label, direction="up")
markers <- getTopMarkers(out$statistics, out$pairs, n=10)

We then supply these genes to SingleR() directly via the genes= argument. A more focused gene set also allows annotation to be performed more quickly compared to the default approach.

pred <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=markers)
table(pred$labels)
## 
##  acinar    beta   delta    duct      pp unclear 
##      59       4       1      34       1       1

In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels. This is accommodated by supplying a named list of character vectors to genes. Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded.

label.markers <- lapply(markers, unlist, recursive=FALSE)
pred2 <- SingleR(test=sceG, ref=sceM, labels=sceM$label, genes=label.markers)
table(pred$labels, pred2$labels)
##          
##           acinar beta delta duct pp
##   acinar      53    0     0    6  0
##   beta         0    4     0    0  0
##   delta        0    0     1    0  0
##   duct         0    0     0   34  0
##   pp           0    0     0    0  1
##   unclear      0    0     0    1  0

4 Annotation diagnostics

4.1 Based on the scores within cells

SingleR provides a few basic yet powerful visualization tools. plotScoreHeatmap() displays the scores for all cells across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. We can also display clusters (or other metadata information) for each cell by setting clusters= or annotation_col=. In this case, we display which donor the cells came from and the labels assigned to each cell.

plotScoreHeatmap(pred, show.labels = TRUE,
    annotation_col=data.frame(donor=sceG$donor,
        row.names=rownames(pred)))

For this plot, the key point is to examine the spread of scores within each cell. Ideally, each cell (i.e., column) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain.

4.2 Based on the deltas across cells

The pruneScores() function will remove potentially poor-quality or ambiguous assignments. In particular, ambiguous assignments are identified based on the per-cell delta, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference. The exact threshold used for pruning is identified using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts.

to.remove <- pruneScores(pred)
summary(to.remove)
##    Mode   FALSE 
## logical     100

By default, SingleR() will also report pruned labels where low-quality assignments are replaced with NA. However, the default pruning thresholds may not be appropriate for every dataset - see ?pruneScores for a more detailed discussion. We also provide the plotScoreDistribution() to help determine whether the thresholds are appropriate. This displays the per-label distribution of the differences-from-median across cells, from which pruneScores() defines an appropriate threshold as 3 median absolute deviations (MADs) below the median.

plotScoreDistribution(pred, show = "delta.med", ncol = 3, show.nmads = 3)