SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference set. Specifically, for each test cell:
Automatic annotation provides a convenient way of transferring biological knowledge across datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner.
SingleR provides several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated data retrieval functions.
For example, we obtain reference data from the Human Primary Cell Atlas using the
which returns a
SummarizedExperiment object containing matrix of log-expression values with sample-level labels.
library(SingleR) hpca.se <- HumanPrimaryCellAtlasData() hpca.se
## class: SummarizedExperiment ## dim: 19363 713 ## metadata(0): ## assays(1): logcounts ## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3 ## rowData names(0): ## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234 ## colData names(3): label.main label.fine label.ont
Our test dataset will is taken from La Manno et al. (2016).
For the sake of speed, we will only label the first 100 cells from this dataset.
library(scRNAseq) hESCs <- LaMannoBrainData('human-es') hESCs <- hESCs[,1:100] # SingleR() expects log-counts, but the function will also happily take raw # counts for the test dataset. The reference, however, must have log-values. library(scater) hESCs <- logNormCounts(hESCs)
We use our
hpca.se reference to annotate each cell in
hESCs via the
SingleR() function, which uses the algorithm described above.
Note that the default marker detection method is to take the genes with the largest positive log-fold changes in the per-label medians for each gene.
pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main) pred.hesc
## DataFrame with 100 rows and 5 columns ## scores first.labels ## <matrix> <character> ## 1772122_301_C02 0.118427:0.179700:0.157326:... Neuroepithelial_cell ## 1772122_180_E05 0.129708:0.236277:0.202371:... Neuroepithelial_cell ## 1772122_300_H02 0.158201:0.250060:0.211832:... Neuroepithelial_cell ## 1772122_180_B09 0.158779:0.277166:0.222681:... Neuroepithelial_cell ## 1772122_180_G04 0.138505:0.236659:0.190924:... Neuroepithelial_cell ## ... ... ... ## 1772122_299_E07 0.1459310:0.241154:0.217383:... Neuroepithelial_cell ## 1772122_180_D02 0.1229834:0.239181:0.181222:... Neuroepithelial_cell ## 1772122_300_D09 0.1297573:0.233775:0.196638:... Neuroepithelial_cell ## 1772122_298_F09 0.1431189:0.262267:0.214330:... Neuroepithelial_cell ## 1772122_302_A11 0.0912854:0.185945:0.139232:... Neuroepithelial_cell ## tuning.scores labels pruned.labels ## <DataFrame> <character> <character> ## 1772122_301_C02 0.1824402:0.0991116 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_180_E05 0.1375484:0.0647134 Neurons Neurons ## 1772122_300_H02 0.2757982:0.1369690 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_180_B09 0.0851623:0.0819878 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_180_G04 0.1988415:0.1016622 Neuroepithelial_cell Neuroepithelial_cell ## ... ... ... ... ## 1772122_299_E07 0.1760025:0.0922504 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_180_D02 0.1967609:0.1124805 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_300_D09 0.0816424:0.0221368 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_298_F09 0.1872499:0.0671893 Neuroepithelial_cell Neuroepithelial_cell ## 1772122_302_A11 0.1560800:0.1051322 Astrocyte Astrocyte
Each row of the output
DataFrame contains prediction results for a single cell.
Labels are shown before fine-tuning (
first.labels), after fine-tuning (
labels) and after pruning (
pruned.labels), along with the associated scores.
We summarize the distribution of labels across our subset of cells:
## ## Astrocyte Neuroepithelial_cell Neurons ## 14 81 5
At this point, it is worth noting that SingleR is workflow/package agnostic.
The above example uses
SummarizedExperiment objects, but the same functions will accept any (log-)normalized expression matrix.
Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.
library(scRNAseq) sceM <- MuraroPancreasData() # One should normally do cell-based quality control at this point, but for # brevity's sake, we will just remove the unlabelled libraries here. sceM <- sceM[,!is.na(sceM$label)] sceM <- logNormCounts(sceM)
We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.
sceG <- GrunPancreasData() sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts. sceG <- logNormCounts(sceG) sceG <- sceG[,1:100]
We then run
SingleR() as described previously but with a marker detection mode that considers the variance of expression across cells.
Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels.
This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero).
pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox") table(pred.grun$labels)
## ## acinar beta delta duct endothelial ## 53 4 1 41 1
SingleR provides a few basic yet powerful visualization tools.
plotScoreHeatmap() displays the scores for all cells across all reference labels,
which allows users to inspect the confidence of the predicted labels across the dataset.
The actual assigned label for each cell is shown in the color bar at the top;
note that this may not be the visually top-scoring label if fine-tuning is applied, as the only the pre-tuned scores are directly comparable across all labels.
For this plot, the key point is to examine the spread of scores within each cell. Ideally, each cell (i.e., column of the heatmap) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain, though this may be acceptable if the uncertainty is distributed across similar cell types that cannot be easily resolved.
We can also display other metadata information for each cell by setting
This is occasionally useful for examining potential batch effects, differences in cell type composition between conditions, relationship to clusters from an unsupervised analysis, etc.
In the code below, we display which donor each cell comes from:
pruneScores() function will remove potentially poor-quality or ambiguous assignments.
In particular, ambiguous assignments are identified based on the per-cell “delta”, i.e., the difference between the score for the assigned label and the median across all labels for each cell.
Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell’s true label does not exist in the reference.
The exact threshold used for pruning is identified using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts.
to.remove <- pruneScores(pred.grun) summary(to.remove)
## Mode FALSE TRUE ## logical 96 4
SingleR() will report pruned labels in the
pruned.labels field where low-quality assignments are replaced with
However, the default pruning thresholds may not be appropriate for every dataset - see
?pruneScores for a more detailed discussion.
We provide the
plotScoreDistribution() to help in determining whether the thresholds are appropriate by using information across cells with the same label.
This displays the per-label distribution of the deltas across cells, from which
pruneScores() defines an appropriate threshold as 3 median absolute deviations (MADs) below the median.
plotScoreDistribution(pred.grun, show = "delta.med", ncol = 3, show.nmads = 3)