# 1 scDblFinder

scDblFinder identifies doublets in single-cell RNAseq by creating artificial doublets and looking at their prevalence (as well as that of any known doublets) in the neighborhood of each cell, along with a few other covariates. The rough logic is very similar to other methods (e.g. DoubletFinder, scds), with a few twists that make it more efficient, more accurate, and provide extra features.

## 1.1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scDblFinder")

# or, to get that latest developments:
BiocManager::install("plger/scDblFinder")

## 1.2 Usage

The input of scDblFinder is an object sce of class SingleCellExperiment (empty drops having already been removed) containing at least the counts (assay ‘counts’). If normalized expression (assay ‘logcounts’) and PCA (reducedDim ‘PCA’) are also present, these will be used for the clustering step (unless clusters are given.)

Given an SCE object:

set.seed(123)
library(scDblFinder)
# we create a dummy dataset; since it's small we set a higher doublet rate
sce <- mockDoubletSCE(dbl.rate=0.1)
# we run scDblFinder (providing the unsually high doublet rate)
sce <- scDblFinder(sce, dbr=0.1)
## Clustering cells...
## 4 clusters
## Creating ~5000 artifical doublets...
## Dimensional reduction
## Evaluating kNN...
## Training model...
## iter=0, 27 cells excluded from training.
## Threshold found:0.854
## 24 (4.6%) doublets called

For 10x data, it is usually safe to leave the dbr empty, and it will be automatically estimated. scDblFinder will add a number of columns to the colData of sce prefixed with ‘scDblFinder’, the most important of which are:

• sce$scDblFinder.score : the final doublet score • sce$scDblFinder.class : the classification (doublet or singlet)
table(truth=sce$type, call=sce$scDblFinder.class)
##          call
## truth     singlet doublet
##   singlet     500       0
##   doublet       0      24

### 1.2.1 Multiple samples

If you have multiple samples (understood as different cell captures), then it is preferable to look for doublets separately for each sample (for multiplexed samples with cell hashes, this means for each batch). You can do this by simply providing a vector of the sample ids to the samples parameter of scDblFinder or, if these are stored in a column of colData, the name of the column. In this case, you might also consider multithreading it using the BPPARAM parameter (assuming you’ve got enough RAM!). For example:

library(BiocParallel)
sce <- scDblFinder(sce, samples="sample_id", BPPARAM=MulticoreParam(3))
table(sce$scDblFinder.class) Note that if you are running multiple samples, clustering will be performed sample-wise. While this is not an issue for doublet identification, it means that the cluster labels (and putative origins of doublets) won’t match between samples. If you are interested in these, it is preferable to first cluster (for example using sce$cluster <- fastcluster(sce)) and then provide the clusters to scDblFinder, which will ensure concordant labels across samples.

Of note, if you have very large differences in number of cells between samples the scores will not be directly comparable. We are working on improving this, but in the meantime it would be preferable to stratify similar samples and threshold the sets separately.

## 1.3 Description of the method

Wrapped in the scDblFinder function are the following steps:

### 1.3.1 Splitting captures

Doublets can only arise within a given sample or capture, and for this reason are better sought independently for each sample, which also speeds up the analysis. If the samples argument is given, scDblFinder will use it to split the cells into samples/captures, and process each of them in parallel if the BPPARAM argument is given. The classifier will be trained globally, but thresholds will be optimized on a per-sample basis. If your samples are multiplexed, i.e. the different samples are mixed in different batches, then the batches should be what you provide to this argument.

### 1.3.2 Reducing and clustering the data

The analysis can be considerably sped up, at little if any cost in accuracy, by reducing the dataset to only the top expressed genes (controlled by the nfeatures argument). Then, if the clusters argument isn’t provided, the cells will be clustered using a fast clustering procedure on the PCA space (taken from the ‘PCA’ dimRed if present, otherwise generated internally). The aim, here, is to favour over-clustering so as to avoid collapsing pairs of celltypes whose doublets could be distinguishable. The internal clustering procedure can be accessed via the fastcluster function, and the cluster assignments are included in the output.

If the dataset is not expected to contain distinct subpopulations but rather continuous gradients, e.g. trajectories, then it might be advisable to employ a different approach and setting clusters to a positive integer (see below).

### 1.3.3 Generating artificial doublets

In addition to a proportion of artificial doublets generated by combining random cells, a large fraction of the doublets are generated on all pairs of non-identical clusters (this can be performed manually using the getArtificialDoublets function). The rationale is that homotypic doublets are nearly impossible to distinguish without cellular barcodes, and therefore that creating that kind of doublets is a waste of computational resources.

If the dataset is not expected to contain distinct subpopulations, but rather trajectories, a different approach can be used. A first option is to generate artificial doublets randomly, which can be done using the propRandom=1 argument. A better approach is (unless you already have your clusters) setting the clusters argument of scDblFinder to a positive integer (depending on the number of cells and complexity, e.g. k=12 for smaller datasets), which will split the cells using k-means clustering (where k=clusters). Then, there are two possible strategies. Running scDblFinder with otherwise default settings will automatically determine which combinations of clusters are distinguishable. Alternatively, using trajectoryMode=TRUE will relate the clusters’ average profiles and generate fewer doublets between very nearby points in the trajectory (where they would be wasted, such doublets being nearly undistinguishable from real cells). This alternative however disrupts the proportionality of the doublets to the cluster sizes, and may therefore decrease accuracy when the cells are unevenly distributed.

### 1.3.4 Examining the k-nearest neighbors (kNN) of each cell

A new PCA is performed on the combination of real and artificial cells, from which a kNN network is generated. Using this kNN, a number of parameters are gathered for each cell, such as the proportion of doublets (i.e. artificial doublets or known doublets provided through the knownDoublets argument, if given) among the KNN, ratio of the distances to the nearest doublet and nearest non-doublet, etc. Several of this features are reported in the output with the ‘scDblFinder.’ prefix, e.g.:

• distanceToNearest : distance to the nearest cell (real or artificial)
• nearestClass : whether the nearest cell is a doublet or singlet
• ratio : the proportion of the KNN that are doublets. (If more than one value of k is given, the various ratios will be used during classification and will be reported)
• weighted : the proportion of the KNN that are doublets, weighted by their distance (useful for isolated cells)

### 1.3.5 Training a classifier

Unless the score argument is set to ‘weighted’ or ‘ratio’ (in which case the aforementioned ratio is directly used as a doublet score), scDblFinder then uses gradient boosted trees trained on the kNN-derived properties along with a few additional features (e.g. library size, number of non-zero features, and an estimate of the difficultly of detecting artificial doublets in the cell’s neighborhood, etc.) to distinguish doublets (either artificial or given) from other cells, and assigns a score on this basis. If the use.cxds=TRUE, the cxds score from the scds package will also be included among the predictors.

One problem of using a classifier for this task is that some of the real cells (the actual doublets) are mislabeled as singlet, so to speak. scDblFinder therefore iteratively retrains the classifier, each time excluding from the training the (real) cells called as doublets in the previous step (the number of steps being controlled by the iter parameter).

This score is available in the output, in the scDblFinder.score colData column, and can be interpreted as a probability. If the data is multi-sample, a single model is trained for all samples.

### 1.3.6 Thresholding

Rather than thresholding on some arbitrary cutoff of the score, scDblFinder uses the expected number of doublets to establish a threshold. Unless it is manually given through the dbr argument, the expected doublet rate is first estimated using the empirical rule of thumb applicable to 10X data, namely roughly 1% per 1000 cells captures (so with 5000 cells, (0.01*5)*5000 = 250 doublets, and the more cells you capture the higher the chance of creating a doublet). If samples were specified, and if the dbr is automatically calculated, thresholding is performed separately across samples.

Thresholding then tries to simultaneously minimize: 1) the classification error (in terms of the proportion of known doublets below the threshold) and 2) the deviation from the expected number of doublets among real cells (as a ratio of the total number of expected doublets within the range determined by dbr.sd, and adjusted for homotypic doublets), giving an equal weight to each. This means that, if you have no idea about the doublet rate, setting dbr.sd=1 will make the threshold depend entirely on the misclassification rate.

### 1.5.3 ‘Size factors should be positive’ error

You will get this error if you have some cells that have zero (or very close to zero) reads. After filtering out these cells the error should go away. Note, however, that we recommend not having too stringent filtering before running scDblFinder.

### 1.5.4 How can I make this reproducible?

Because it relies on the partly random generation of artificial doublets, running scDblFinder multiple times on the same data will yield slightly different results. You can ensure reproducibility using set.seed(), however this will not be sufficient when multithreading. In such case, using the following procedure:

bp <- MulticoreParam(3, RNGseed=1234)
bpstart(bp)
sce <- scDblFinder(sce, clusters="cluster", samples="sample", BPPARAM=bp)
bpstop(bp)

### 1.5.5 Can I use this in combination with Seurat or other tools?

If the input SCE already contains a logcounts assay or a reducedDim slot named ‘PCA’, scDblFinder will use them for the clustering step. In addition, a clustering can be manually given using the clusters argument of scDblFinder(). In this way, seurat clustering could for instance be used to create the artificial doublets (see ?Seurat::as.SingleCellExperiment.Seurat for conversion to SCE).

After artificial doublets generation, the counts of real and artificial cells must then be reprocessed (i.e. normalization and PCA) together, which is performed internally using scater. If you wish this step to be performed differently, you may provide your own function for doing so (see the processing argument in ?scDblFinder). We note, however, that the impact of variations of this step on doublet detection is far milder than it is for, say, clustering: indeed, not performing any normalization at all for instance decreases doublet identification accuracy, but by very little.

### 1.5.6 Can this be used with scATACseq data?

Yes, specifically working on peak-level data. Because of the much greater sparsity of single-cell ATAC-seq data and the fact that scDblFinder works with a selection of features, using it with standard parameters will yield a poor performance. We therefore recommend to use aggregateFeatures=TRUE, which will aggregate similar features (rather than selecting features) before the normal scDblFinder procedure, and yields excellent results.