Chapter 8 Doublet detection

8.1 Overview

In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results.

Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2018). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available.

A more general approach is to infer doublets from the expression profiles alone (Dahlin et al. 2018). In this workflow, we will describe two purely computational approaches for detecting doublets from scRNA-seq data. The main difference between these two methods is whether or not they need cluster information beforehand. We will demonstrate the use of these methods on 10X Genomics data from a droplet-based scRNA-seq study of the mouse mammary gland (Bach et al. 2017).

#--- loading ---#
library(scRNAseq)
sce.mam <- BachMammaryData(samples="G_1")

#--- gene-annotation ---#
library(scater)
rownames(sce.mam) <- uniquifyFeatureNames(
    rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol)

library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.mam)$SEQNAME <- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl,
    keytype="GENEID", column="SEQNAME")

#--- quality-control ---#
is.mito <- rowData(sce.mam)$SEQNAME == "MT"
stats <- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito)))
qc <- quickPerCellQC(stats, percent_subsets="subsets_Mito_percent")
sce.mam <- sce.mam[,!qc$discard]

#--- normalization ---#
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.mam)
sce.mam <- computeSumFactors(sce.mam, clusters=clusters)
sce.mam <- logNormCounts(sce.mam)

#--- variance-modelling ---#
set.seed(00010101)
dec.mam <- modelGeneVarByPoisson(sce.mam)
top.mam <- getTopHVGs(dec.mam, prop=0.1)

#--- dimensionality-reduction ---#
library(BiocSingular)
set.seed(101010011)
sce.mam <- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam)
sce.mam <- runTSNE(sce.mam, dimred="PCA")

#--- clustering ---#
snn.gr <- buildSNNGraph(sce.mam, use.dimred="PCA", k=25)
colLabels(sce.mam) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
sce.mam
## class: SingleCellExperiment 
## dim: 27998 2772 
## metadata(0):
## assays(2): counts logcounts
## rownames(27998): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1
## rowData names(3): Ensembl Symbol SEQNAME
## colnames: NULL
## colData names(5): Barcode Sample Condition sizeFactor label
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

8.2 Doublet detection with clusters

The findDoubletClusters() function from the scDblFinder package identifies clusters with expression profiles lying between two other clusters (Bach et al. 2017). We consider every possible triplet of clusters consisting of a query cluster and two putative “source” clusters. Under the null hypothesis that the query consists of doublets from the two sources, we compute the number of genes (num.de) that are differentially expressed in the same direction in the query cluster compared to both of the source clusters. Such genes would be unique markers for the query cluster and provide evidence against the null hypothesis. For each query cluster, the best pair of putative sources is identified based on the lowest num.de. Clusters are then ranked by num.de where those with the few unique genes are more likely to be composed of doublets.

# Like 'findMarkers', this function will automatically
# retrieve cluster assignments from 'colLabels'.
library(scDblFinder)
dbl.out <- findDoubletClusters(sce.mam)
dbl.out
## DataFrame with 10 rows and 9 columns
##        source1     source2    num.de median.de        best     p.value
##    <character> <character> <integer> <numeric> <character>   <numeric>
## 6            2           1        13     507.5       Pcbp2 1.28336e-03
## 2           10           3       109     710.5        Pigr 4.34790e-21
## 4            6           5       111     599.5       Cotl1 1.09709e-08
## 5           10           7       139     483.5        Gde1 9.30195e-12
## 10           8           5       191     392.0       Krt18 5.54539e-20
## 7            8           5       270     784.5    AF251705 3.29661e-24
## 9            8           5       295     514.5       Fabp4 2.21523e-32
## 8           10           9       388     658.5      Col1a1 6.82664e-32
## 1            8           6       513    1834.0       Acta2 1.07294e-24
## 3            6           5       530    1751.5      Sapcd2 6.08574e-16
##    lib.size1 lib.size2       prop
##    <numeric> <numeric>  <numeric>
## 6   0.811531  0.516399 0.03030303
## 2   0.619865  1.411579 0.28823954
## 4   1.540751  0.688651 0.16305916
## 5   1.125474  1.167854 0.00865801
## 10  0.888432  0.888514 0.00865801
## 7   0.856192  0.856271 0.01875902
## 9   0.655624  0.655685 0.01154401
## 8   1.125578  1.525264 0.01406926
## 1   0.865449  1.936489 0.19841270
## 3   0.872951  0.390173 0.25829726

If a more concrete threshold is necessary, we can identify clusters that have unusually low num.de using an outlier-based approach.

library(scater)
chosen.doublet <- rownames(dbl.out)[isOutlier(dbl.out$num.de, type="lower", log=TRUE)]
chosen.doublet
## [1] "6"

The function also reports the ratio of the median library size in each source to the median library size in the query (lib.size fields). Ideally, a potential doublet cluster would have ratios lower than unity; this is because doublet libraries are generated from a larger initial pool of RNA compared to libraries for single cells, and thus the former should have larger library sizes. The proportion of cells in the query cluster should also be reasonable - typically less than 5% of all cells, depending on how many cells were loaded onto the 10X Genomics device.

Examination of the findDoubletClusters() output indicates that cluster 6 has the fewest unique genes and library sizes that are comparable to or greater than its sources. We see that every gene detected in this cluster is also expressed in either of the two proposed source clusters (Figure 8.1).

library(scran)
markers <- findMarkers(sce.mam, direction="up")
dbl.markers <- markers[[chosen.doublet]]

library(scater)
chosen <- rownames(dbl.markers)[dbl.markers$Top <= 10]
plotHeatmap(sce.mam, order_columns_by="label", features=chosen, 
    center=TRUE, symmetric=TRUE, zlim=c(-5, 5))
Heatmap of mean-centered and normalized log-expression values for the top set of markers for cluster 6 in the mammary gland dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Figure 8.1: Heatmap of mean-centered and normalized log-expression values for the top set of markers for cluster 6 in the mammary gland dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend.

Closer examination of some known markers suggests that the offending cluster consists of doublets of basal cells (Acta2) and alveolar cells (Csn2) (Figure 8.2). Indeed, no cell type is known to strongly express both of these genes at the same time, which supports the hypothesis that this cluster consists solely of doublets rather than being an entirely novel cell type.

plotExpression(sce.mam, features=c("Acta2", "Csn2"), 
    x="label", colour_by="label")
Distribution of log-normalized expression values for _Acta2_ and _Csn2_ in each cluster. Each point represents a cell.

Figure 8.2: Distribution of log-normalized expression values for Acta2 and Csn2 in each cluster. Each point represents a cell.

The strength of findDoubletClusters() lies in its simplicity and ease of interpretation. Suspect clusters can be quickly flagged based on the metrics returned by the function. However, it is obviously dependent on the quality of the clustering. Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation. The method is also somewhat biased towards clusters with fewer cells, where the reduction in power is more likely to result in a low N. (Fortunately, this is a desirable effect as doublets should be rare in a properly performed scRNA-seq experiment.)

8.3 Doublet detection by simulation

8.3.1 Computing doublet densities

The other doublet detection strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the computeDoubletDensity() function from scDblFinder, which will:

  1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
  2. For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
  3. For each original cell, compute the density of other observed cells in the neighborhood.
  4. Return the ratio between the two densities as a “doublet score” for each cell.

This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations.

We see the function in action below. To speed up the density calculations, computeDoubletDensity() will perform a PCA on the log-expression matrix, and we perform some (optional) parametrization to ensure that the computed PCs are consistent with that from our previous analysis on this dataset.

library(BiocSingular)
set.seed(100)

# Setting up the parameters for consistency with denoisePCA();
# this can be changed depending on your feature selection scheme.
dbl.dens <- computeDoubletDensity(sce.mam, subset.row=top.mam, 
    d=ncol(reducedDim(sce.mam)))
summary(dbl.dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.249   0.527   1.041   1.188  14.570

The highest doublet scores are concentrated in a single cluster of cells in the center of Figure 8.3.

sce.mam$DoubletScore <- dbl.dens
plotTSNE(sce.mam, colour_by="DoubletScore")
t-SNE plot of the mammary gland data set. Each point is a cell coloured according to its doublet density.

Figure 8.3: t-SNE plot of the mammary gland data set. Each point is a cell coloured according to its doublet density.

We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample (Pijuan-Sala et al. 2019). Here, we only have one sample so the call below is fairly simple, but it can also support multiple samples if the input data.frame has a sample column.

dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
    method="griffiths", returnType="call")
summary(dbl.calls)
## singlet doublet 
##    2400     372

From the clustering information, we see that the affected cells belong to the same cluster that was identified using findDoubletClusters() (Figure 8.4), which is reassuring.

plotColData(sce.mam, x="label", y="DoubletScore", colour_by=I(dbl.calls))
Distribution of doublet scores for each cluster in the mammary gland data set. Each point is a cell and is colored according to whether it was called as a doublet.

Figure 8.4: Distribution of doublet scores for each cluster in the mammary gland data set. Each point is a cell and is colored according to whether it was called as a doublet.

The advantage of computeDoubletDensity() is that it does not depend on clusters, reducing the sensitivity of the results to clustering quality. The downside is that it requires some strong assumptions about how doublets form, such as the combining proportions and the sampling from pure subpopulations. In particular, computeDoubletDensity() treats the library size of each cell as an accurate proxy for its total RNA content. If this is not true, the simulation will not combine expression profiles from different cells in the correct proportions. This means that the simulated doublets will be systematically shifted away from the real doublets, resulting in doublet scores that are too low.

(As an aside, the issue of unknown combining proportions can be solved if spike-in information is available, e.g., in plate-based protocols. This will provide an accurate estimate of the total RNA content of each cell. To this end, spike-in-based size factors from Basic Section 2.4 can be supplied to the computeDoubletDensity() function via the size.factors.content= argument. This will use the spike-in size factors to scale the contribution of each cell to a doublet library.)

8.3.2 Doublet classification

The scDblFinder() function (Germain et al. 2021) combines the simulated doublet density with an iterative classification scheme. For each observed cell, an initial score is computed by combining the fraction of simulated doublets in its neighborhood with another score based on co-expression of mutually exclusive gene pairs (Bais and Kostka 2020). A threshold is chosen that best distinguishes between the real and simulated cells, allowing us to obtain putative doublet calls among the real cells. The threshold and scores are then iteratively refined by training a classifier on the putative calls with a variety of metrics to characterize the doublet neighborhood. These metrics include the low-dimensional embeddings in PC space, the fraction of doublets among the \(k\) nearest neighbors for a variety of \(k\), the distance to the closest real cell, the expected within-cluster doublet formation rate, the observed number of cells in each cluster, and so on. The classifier distills all of these metrics into a single score based on their learnt importance (Figure 8.5).

set.seed(10010101)
sce.mam.dbl <- scDblFinder(sce.mam, clusters=colLabels(sce.mam))
plotTSNE(sce.mam.dbl, colour_by="scDblFinder.score")
t-SNE plot of the mammary gland data set where each point is a cell coloured according to its `scDblFinder()` score.

Figure 8.5: t-SNE plot of the mammary gland data set where each point is a cell coloured according to its scDblFinder() score.

We can also extract explicit doublet calls for each cell based on the final threshold from the iterative process.

table(sce.mam.dbl$scDblFinder.class)
## 
## singlet doublet 
##    2598     174

We set clusters= to instruct scDblFinder() to exclusively simulate doublets between different clusters in Figure 8.5. This improves the efficiency of the simulation process by eliminating intra-cluster doublets that are indistinguishable from singlets. Doing so is entirely optional and can be omitted if clustering information is not available or the clusters are not well-defined. The default of clusters=NULL will cause scDblFinder() to fall back to a random simulation scheme, similar to that used by computeDoubletDensity().

Compared to computeDoubletDensity(), scDblFinder() provides a more sophisticated approach that makes greater use of the information in the simulated doublets. This can improve the performance of doublet identification in difficult scenarios, e.g., when one of the two cells contributes considerably more to the doublet, or when doublets are transcriptionally similar to real intermediate stages. However, as with all algorithms of this class, accuracy is dependent on the correctness of the simulation process.

8.3.3 Further comments

Simply removing cells with high doublet scores will typically not be sufficient to eliminate real doublets from the data set. In some cases, only a subset of the cells in the putative doublet cluster actually have high scores, and removing these would still leave enough cells in that cluster to mislead downstream analyses. In fact, even defining a threshold on the doublet score is difficult as the interpretation of the score is relative. There is no general definition for a fixed threshold above which libraries are to be considered doublets.

We recommend interpreting these scores in the context of cluster annotation. All cells from a cluster with a large average doublet score should be considered suspect, and close neighbors of problematic clusters should be treated with caution. A cluster containing only a small proportion of high-scoring cells is safer, though this prognosis comes with the caveat that true doublets often lie immediately adjacent to their source populations and end up being assigned to the same cluster. It is worth confirming that any interesting results of downstream analyses are not being driven by those cells, e.g., by checking that DE in an interesting gene is not driven solely by cells with high doublet scores. While clustering is still required for interpretation, the simulation-based strategy is more robust than findDoubletClusters() to the quality of the clustering as the scores are computed on a per-cell basis.

8.4 Doublet detection in multiplexed experiments

8.4.1 Background

For multiplexed samples (Kang et al. 2018; Stoeckius et al. 2018), we can identify doublet cells based on the cells that have multiple labels. The idea here is that cells from the same sample are labelled in a unique manner, either implicitly with genotype information or experimentally with hashing tag oligos (HTOs). Cells from all samples are then mixed together and the multiplexed pool is subjected to scRNA-seq, avoiding batch effects and simplifying the logistics of processing a large number of samples. Importantly, most per-cell libraries are expected to contain one label that can be used to assign that cell to its sample of origin. Cell libraries containing two labels are thus likely to be doublets of cells from different samples.

To demonstrate, we will use some data from the original cell hashing study (Stoeckius et al. 2018). Each sample’s cells were stained with an antibody against a ubiquitous surface protein, where the antibody was conjugated to a sample-specific HTO. Sequencing of the HTO-derived cDNA library ultimately yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode.

library(scRNAseq)
hto.sce <- StoeckiusHashingData(mode="hto")
dim(hto.sce)
## [1]     8 65000

8.4.2 Identifying inter-sample doublets

Before we proceed to doublet detection, we simplify the problem by first identifying the barcodes that contain cells. This is most conventionally done using the gene expression matrix for the same set of barcodes, as shown in Section 7.2. Here, though, we will keep things simple and apply emptyDrops() directly on the HTO count matrix. The considerations are largely the same as that for gene expression matrices; the main difference is that the default lower= is often too low for deeply sequenced HTOs, so we instead estimate the ambient profile by excluding the top by.rank= barcodes with the largest totals (under the assumption that no more than by.rank= cells were loaded). The barcode-rank plots are quite similar to what one might expect from gene expression data (Figure 8.6).

library(DropletUtils)

set.seed(101)
hash.calls <- emptyDrops(counts(hto.sce), by.rank=40000)
is.cell <- which(hash.calls$FDR <= 0.001)
length(is.cell)
## [1] 21780
par(mfrow=c(1,2))
r <- rank(-hash.calls$Total)
plot(r, hash.calls$Total, log="xy", xlab="Rank", ylab="Total HTO count", main="")
hist(log10(hash.calls$Total[is.cell]), xlab="Log[10] HTO count", main="")
Cell-calling statistics from running `emptyDrops()` on the HTO counts in the cell hashing study. Left: Barcode rank plot of the HTO counts in the cell hashing study. Right: distribution of log-total counts for libraries identified as cells.

Figure 8.6: Cell-calling statistics from running emptyDrops() on the HTO counts in the cell hashing study. Left: Barcode rank plot of the HTO counts in the cell hashing study. Right: distribution of log-total counts for libraries identified as cells.

We then run hashedDrops() on the subset of cell barcode libraries that actually contain cells. This returns the likely sample of origin for each barcode library based on its most abundant HTO, using abundances adjusted for ambient contamination in the ambient= argument. (The adjustment process itself involves a fair number of assumptions that we will not discuss here; see ?hashedDrops for more details.) For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure 7.9), allowing us to quantify the certainty of each assignment. Confidently assigned singlets are marked using the Confident field in the output.

hash.stats <- hashedDrops(counts(hto.sce)[,is.cell],
    ambient=metadata(hash.calls)$ambient)

hist(hash.stats$LogFC, xlab="Log fold-change from best to second HTO", main="")
Distribution of log-fold changes from the first to second-most abundant HTO in each cell.

Figure 7.9: Distribution of log-fold changes from the first to second-most abundant HTO in each cell.

# Raw assignments:
table(hash.stats$Best)
## 
##    1    2    3    4    5    6    7    8 
## 2703 3276 2752 2782 2493 2381 2586 2807
# Confident assignments based on (i) a large log-fold change 
# and (ii) not being a doublet, see below.
table(hash.stats$Best[hash.stats$Confident])
## 
##    1    2    3    4    5    6    7    8 
## 2349 2779 2457 2275 2090 1994 2176 2458

Of greater interest here is how we can use the hashing information to detect doublets. This is achieved by reporting the log-fold change between the count for the second HTO and the estimated contribution from ambient contamination. A large log-fold change indicates that the second HTO still has an above-expected abundance, consistent with a doublet containing HTOs from two samples. We use outlier detection to explicitly identify putative doublets as those barcode libraries that have large log-fold changes; this is visualized in Figure 8.7, which shows a clear separation between the putative singlets and doublets.

summary(hash.stats$Doublet)
##    Mode   FALSE    TRUE 
## logical   18742    3038
colors <- rep("grey", nrow(hash.stats))
colors[hash.stats$Doublet] <- "red"
colors[hash.stats$Confident] <- "black"

plot(hash.stats$LogFC, hash.stats$LogFC2,
    xlab="Log fold-change from best to second HTO",
    ylab="Log fold-change of second HTO over ambient",
    col=colors)
Log-fold change of the second-most abundant HTO over ambient contamination, compared to the log-fold change of the first HTO over the second HTO. Each point represents a cell where potential doublets are shown in red while confidently assigned singlets are shown in black.

Figure 8.7: Log-fold change of the second-most abundant HTO over ambient contamination, compared to the log-fold change of the first HTO over the second HTO. Each point represents a cell where potential doublets are shown in red while confidently assigned singlets are shown in black.

8.4.3 Guilt by association for unmarked doublets

One obvious limitation of this approach is that doublets of cells marked with the same HTO are not detected. In a simple multiplexing experiment involving \(N\) samples with similar numbers of cells, we would expect around \(1/N\) of all doublets to involve cells from the same sample. For typical values of \(N\) of 5 to 12, this may still be enough to cause the formation of misleading doublet clusters even after the majority of known doublets are removed. To avoid this, we recover the remaining intra-sample doublets based on their similarity with known doublets in gene expression space (hence, “guilt by association”). We illustrate by loading the gene expression data for this study:

sce.hash <- StoeckiusHashingData(mode="human")

# Subsetting to all barcodes detected as cells. Requires an intersection,
# because `hto.sce` and `sce.hash` are not the same dimensions! 
common <- intersect(colnames(sce.hash), rownames(hash.stats))
sce.hash <- sce.hash[,common]
colData(sce.hash) <- hash.stats[common,]

sce.hash
## class: SingleCellExperiment 
## dim: 27679 20828 
## metadata(0):
## assays(1): counts
## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30
## rowData names(0):
## colnames(20828): ACTGCTCAGGTGTTAA ATGAGGGAGATGTTAG ... CACCAGGCACACAGAG
##   CTCGGAGTCTAACTCT
## colData names(7): Total Best ... Doublet Confident
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

For each cell, we calculate the proportion of its nearest neighbors that are known doublets. Intra-sample doublets should have high proportions under the assumption that their gene expression profiles are similar to inter-sample doublets involving the same combination of cell states/types. Unlike in Section 8.3, the use of experimentally derived doublet calls avoids any assumptions about the relative quantity of total RNA or the probability of doublet formation across different cell types.

# Performing a quick-and-dirty analysis to get some PCs to use
# for nearest neighbor detection inside recoverDoublets().
library(scran)
sce.hash <- logNormCounts(sce.hash)
dec.hash <- modelGeneVar(sce.hash)
top.hash <- getTopHVGs(dec.hash, n=1000)
set.seed(1011110)
sce.hash <- runPCA(sce.hash, subset_row=top.hash, ncomponents=20)

# Recovering the intra-sample doublets:
hashed.doublets <- recoverDoublets(sce.hash, use.dimred="PCA",
    doublets=sce.hash$Doublet, samples=table(sce.hash$Best))
hashed.doublets
## DataFrame with 20828 rows and 3 columns
##       proportion     known predicted
##        <numeric> <logical> <logical>
## 1           0.10      TRUE     FALSE
## 2           0.02     FALSE     FALSE
## 3           0.14     FALSE     FALSE
## 4           0.08     FALSE     FALSE
## 5           0.18     FALSE     FALSE
## ...          ...       ...       ...
## 20824       0.04     FALSE     FALSE
## 20825       0.04     FALSE     FALSE
## 20826       0.02     FALSE     FALSE
## 20827       0.04     FALSE     FALSE
## 20828       0.10     FALSE     FALSE

The recoverDoublets() function also returns explicit intra-sample doublet predictions based on the doublet neighbor proportions. Given the distribution of cells across multiplexed samples in samples=, we estimate the fraction of doublets that would not be observed from the HTO counts. This is converted into an absolute number based on the number of observed doublets; the top set of libraries with the highest proportions are then marked as intra-sample doublets (Figure 8.8)

set.seed(1000101001)
sce.hash <- runTSNE(sce.hash, dimred="PCA")
sce.hash$proportion <- hashed.doublets$proportion
sce.hash$predicted <- hashed.doublets$predicted

gridExtra::grid.arrange(
    plotTSNE(sce.hash, colour_by="proportion") + ggtitle("Doublet proportions"),
    plotTSNE(sce.hash, colour_by="Doublet") + ggtitle("Known doublets"),
    ggcells(sce.hash) +
        geom_point(aes(x=TSNE.1, y=TSNE.2), color="grey") +
        geom_point(aes(x=TSNE.1, y=TSNE.2), color="red", 
            data=function(x) x[x$predicted,]) +
        ggtitle("Predicted intra-sample doublets"),
    ncol=2        
)
$t$-SNE plots for gene expression data from the cell hashing study, where each point is a cell and is colored by the doublet proportion (top left), whether or not it is a known inter-sample doublet (top right) and whether it is a predicted intra-sample doublet (bottom left).

Figure 8.8: \(t\)-SNE plots for gene expression data from the cell hashing study, where each point is a cell and is colored by the doublet proportion (top left), whether or not it is a known inter-sample doublet (top right) and whether it is a predicted intra-sample doublet (bottom left).

As an aside, it is worth noting that even known doublets may not necessarily have high doublet neighbor proportions. This is typically observed for doublets involving cells of the same type or state, which are effectively intermixed in gene expression space with the corresponding singlets. The latter are much more abundant in most (well-controlled) experiments, which results in low proportions for the doublets involved (Figure 8.9). This effect can generally be ignored given the mostly harmless nature of these doublets.

state <- ifelse(hashed.doublets$predicted, "predicted",
    ifelse(hashed.doublets$known, "known", "singlet"))
ggplot(as.data.frame(hashed.doublets)) + 
    geom_violin(aes(x=state, y=proportion)) 
Distribution of doublet neighbor proportions for all cells in the cell hashing study, stratified by doublet detection status.

Figure 8.9: Distribution of doublet neighbor proportions for all cells in the cell hashing study, stratified by doublet detection status.

8.5 Further comments

Doublet detection procedures should only be applied to libraries generated in the same experimental batch. It is obviously impossible for doublets to form between two cells that were captured separately. Thus, some understanding of the experimental design is required prior to the use of the above functions. This avoids unnecessary concerns about the validity of batch-specific clusters that cannot possibly consist of doublets.

It is also difficult to interpret doublet predictions in data containing cellular trajectories. By definition, cells in the middle of a trajectory are always intermediate between other cells and are liable to be incorrectly detected as doublets. Some protection is provided by the non-linear nature of many real trajectories, which reduces the risk of simulated doublets coinciding with real cells in computeDoubletDensity(). One can also put more weight on the relative library sizes in findDoubletClusters() instead of relying solely on N, under the assumption that sudden spikes in RNA content are unlikely in a continuous biological process.

The best solution to the doublet problem is experimental - that is, to avoid generating them in the first place. This should be a consideration when designing scRNA-seq experiments, where the desire to obtain large numbers of cells at minimum cost should be weighed against the general deterioration in data quality and reliability when doublets become more frequent.

Session Info

R version 4.3.0 RC (2023-04-13 r84269)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] DropletUtils_1.20.0         scRNAseq_2.14.0            
 [3] BiocSingular_1.16.0         scran_1.28.1               
 [5] scater_1.28.0               ggplot2_3.4.2              
 [7] scuttle_1.10.1              scDblFinder_1.14.0         
 [9] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.1
[11] Biobase_2.60.0              GenomicRanges_1.52.0       
[13] GenomeInfoDb_1.36.0         IRanges_2.34.0             
[15] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[17] MatrixGenerics_1.12.0       matrixStats_0.63.0         
[19] BiocStyle_2.28.0            rebook_1.10.0              

loaded via a namespace (and not attached):
  [1] later_1.3.1                   BiocIO_1.10.0                
  [3] bitops_1.0-7                  filelock_1.0.2               
  [5] R.oo_1.25.0                   tibble_3.2.1                 
  [7] CodeDepends_0.6.5             graph_1.78.0                 
  [9] XML_3.99-0.14                 lifecycle_1.0.3              
 [11] edgeR_3.42.2                  lattice_0.21-8               
 [13] ensembldb_2.24.0              MASS_7.3-60                  
 [15] magrittr_2.0.3                limma_3.56.1                 
 [17] sass_0.4.6                    rmarkdown_2.21               
 [19] jquerylib_0.1.4               yaml_2.3.7                   
 [21] metapod_1.8.0                 httpuv_1.6.11                
 [23] cowplot_1.1.1                 DBI_1.1.3                    
 [25] RColorBrewer_1.1-3            zlibbioc_1.46.0              
 [27] Rtsne_0.16                    R.utils_2.12.2               
 [29] purrr_1.0.1                   AnnotationFilter_1.24.0      
 [31] RCurl_1.98-1.12               rappdirs_0.3.3               
 [33] GenomeInfoDbData_1.2.10       ggrepel_0.9.3                
 [35] irlba_2.3.5.1                 pheatmap_1.0.12              
 [37] dqrng_0.3.0                   DelayedMatrixStats_1.22.0    
 [39] codetools_0.2-19              DelayedArray_0.26.2          
 [41] xml2_1.3.4                    tidyselect_1.2.0             
 [43] farver_2.1.1                  ScaledMatrix_1.8.1           
 [45] viridis_0.6.3                 BiocFileCache_2.8.0          
 [47] GenomicAlignments_1.36.0      jsonlite_1.8.4               
 [49] BiocNeighbors_1.18.0          ellipsis_0.3.2               
 [51] tools_4.3.0                   progress_1.2.2               
 [53] Rcpp_1.0.10                   glue_1.6.2                   
 [55] gridExtra_2.3                 xfun_0.39                    
 [57] dplyr_1.1.2                   HDF5Array_1.28.1             
 [59] withr_2.5.0                   BiocManager_1.30.20          
 [61] fastmap_1.1.1                 rhdf5filters_1.12.1          
 [63] bluster_1.10.0                fansi_1.0.4                  
 [65] digest_0.6.31                 rsvd_1.0.5                   
 [67] R6_2.5.1                      mime_0.12                    
 [69] colorspace_2.1-0              biomaRt_2.56.0               
 [71] RSQLite_2.3.1                 R.methodsS3_1.8.2            
 [73] utf8_1.2.3                    generics_0.1.3               
 [75] data.table_1.14.8             rtracklayer_1.60.0           
 [77] prettyunits_1.1.1             httr_1.4.6                   
 [79] S4Arrays_1.0.4                pkgconfig_2.0.3              
 [81] gtable_0.3.3                  blob_1.2.4                   
 [83] XVector_0.40.0                htmltools_0.5.5              
 [85] bookdown_0.34                 ProtGenerics_1.32.0          
 [87] scales_1.2.1                  png_0.1-8                    
 [89] knitr_1.42                    rjson_0.2.21                 
 [91] curl_5.0.0                    rhdf5_2.44.0                 
 [93] cachem_1.0.8                  stringr_1.5.0                
 [95] BiocVersion_3.17.1            parallel_4.3.0               
 [97] vipor_0.4.5                   AnnotationDbi_1.62.1         
 [99] restfulr_0.0.15               pillar_1.9.0                 
[101] grid_4.3.0                    vctrs_0.6.2                  
[103] promises_1.2.0.1              dbplyr_2.3.2                 
[105] beachmat_2.16.0               xtable_1.8-4                 
[107] cluster_2.1.4                 beeswarm_0.4.0               
[109] evaluate_0.21                 GenomicFeatures_1.52.0       
[111] cli_3.6.1                     locfit_1.5-9.7               
[113] compiler_4.3.0                Rsamtools_2.16.0             
[115] rlang_1.1.1                   crayon_1.5.2                 
[117] labeling_0.4.2                ggbeeswarm_0.7.2             
[119] stringi_1.7.12                viridisLite_0.4.2            
[121] BiocParallel_1.34.1           munsell_0.5.0                
[123] Biostrings_2.68.1             lazyeval_0.2.2               
[125] Matrix_1.5-4                  dir.expiry_1.8.0             
[127] ExperimentHub_2.8.0           hms_1.1.3                    
[129] sparseMatrixStats_1.12.0      bit64_4.0.5                  
[131] Rhdf5lib_1.22.0               KEGGREST_1.40.0              
[133] statmod_1.5.0                 shiny_1.7.4                  
[135] interactiveDisplayBase_1.38.0 highr_0.10                   
[137] AnnotationHub_3.8.0           igraph_1.4.2                 
[139] memoise_2.0.1                 bslib_0.4.2                  
[141] bit_4.0.5                     xgboost_1.7.5.1              

References

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1): 2128.

Bais, A. S., and D. Kostka. 2020. “Scds: Computational Annotation of Doublets in Single-Cell RNA Sequencing Data.” Bioinformatics 36 (4): 1150–8.

Dahlin, J. S., F. K. Hamey, B. Pijuan-Sala, M. Shepherd, W. W. Y. Lau, S. Nestorowa, C. Weinreb, et al. 2018. “A single-cell hematopoietic landscape resolves 8 lineage trajectories and defects in Kit mutant mice.” Blood 131 (21): e1–e11.

Germain, Pierre-Luc, Aaron Lun, Will Macnair, and Mark D. Robinson. 2021. “Doublet Identification in Single-Cell Sequencing Data Using scDblFinder.” F1000Research, no. 10:979. https://doi.org/10.12688/f1000research.73600.1.

Kang, H. M., M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, et al. 2018. “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.” Nat. Biotechnol. 36 (1): 89–94.

Pijuan-Sala, B., J. A. Griffiths, C. Guibentif, T. W. Hiscock, W. Jawaid, F. J. Calero-Nieto, C. Mulas, et al. 2019. “A Single-Cell Molecular Map of Mouse Gastrulation and Early Organogenesis.” Nature 566 (7745): 490–95.

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Z. Yeung, W. M. Mauck, P. Smibert, and R. Satija. 2018. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.” Genome Biol. 19 (1): 224.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.