Chapter 7 Advanced options

7.1 Preconstructed indices

Advanced users can split the SingleR() workflow into two separate training and classification steps. This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once for any reference. The resulting data structure can then be re-used across multiple classifications with different test datasets, provided the gene annotation in the test dataset is identical to or a superset of the genes in the training set. To illustrate, we will consider the DICE reference dataset (Schmiedel et al. 2018) from the celldex package.

dice <- DatabaseImmuneCellExpressionData(ensembl=TRUE)
## class: SummarizedExperiment 
## dim: 29914 1561 
## metadata(0):
## assays(1): logcounts
## rownames(29914): ENSG00000121410 ENSG00000268895 ... ENSG00000159840
##   ENSG00000074755
## rowData names(0):
## colnames(1561): TPM_1 TPM_2 ... TPM_101 TPM_102
## colData names(3): label.main label.fine label.ont
##                   B cells, naive                 Monocytes, CD14+ 
##                              106                              106 
##                 Monocytes, CD16+                         NK cells 
##                              105                              105 
##               T cells, CD4+, TFH               T cells, CD4+, Th1 
##                              104                              104 
##              T cells, CD4+, Th17            T cells, CD4+, Th1_17 
##                              104                              104 
##               T cells, CD4+, Th2       T cells, CD4+, memory TREG 
##                              104                              104 
##             T cells, CD4+, naive        T cells, CD4+, naive TREG 
##                              103                              104 
## T cells, CD4+, naive, stimulated             T cells, CD8+, naive 
##                              102                              104 
## T cells, CD8+, naive, stimulated 
##                              102

Let’s say we want to use the DICE reference to annotate the PBMC dataset from Chapter 1.

sce <- TENxPBMCData("pbmc3k")

We use the trainSingleR() function to do all the necessary calculations that are independent of the test dataset. (Almost; see comments below about common.) This yields a list of various components that contains all identified marker genes and precomputed rank indices to be used in the score calculation. We can also turn on aggregation with aggr.ref=TRUE (Section 3.4) to further reduce computational work.

common <- intersect(rownames(sce), rownames(dice))

trained <- trainSingleR(dice[common,], labels=dice$label.fine, aggr.ref=TRUE)

We then use the trained object to annotate our dataset of interest through the classifySingleR() function. As we can see, this yields exactly the same result as applying SingleR() directly. The advantage here is that trained can be re-used for multiple classifySingleR() calls - possibly on different datasets - without having to repeat unnecessary steps when the reference is unchanged.

pred <- classifySingleR(sce, trained, assay.type=1)
##             B cells, naive           Monocytes, CD14+ 
##                        344                        515 
##           Monocytes, CD16+                   NK cells 
##                        187                        320 
##         T cells, CD4+, TFH         T cells, CD4+, Th1 
##                        365                        222 
##        T cells, CD4+, Th17      T cells, CD4+, Th1_17 
##                         64                         62 
##         T cells, CD4+, Th2 T cells, CD4+, memory TREG 
##                         69                        169 
##       T cells, CD4+, naive  T cells, CD4+, naive TREG 
##                        115                         57 
##       T cells, CD8+, naive 
##                        211
# Comparing to the direct approach.
direct <- SingleR(sce, ref=dice, labels=dice$label.fine,
    assay.type.test=1, aggr.ref=TRUE)
identical(pred$labels, direct$labels)
## [1] TRUE

The big caveat is that the universe of genes in the test dataset must be a superset of that the reference. This is the reason behind the intersection to common genes and the subsequent subsetting of dice. Practical use of preconstructed indices is best combined with some prior information about the gene-level annotation; for example, we might know that we always use a particular version of the Ensembl gene models, so we would filter out any genes in the reference dataset that are not in our test datasets.

7.2 Parallelization

Parallelization is an obvious approach to increasing annotation throughput. This is done using the framework in the BiocParallel package, which provides several options for parallelization depending on the available hardware. On POSIX-compliant systems (i.e., Linux and MacOS), the simplest method is to use forking by passing MulticoreParam() to the BPPARAM= argument:

pred2a <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
    BPPARAM=MulticoreParam(8)) # 8 CPUs.

Alternatively, one can use separate processes with SnowParam(), which is slower but can be used on all systems - including Windows, our old nemesis.

pred2b <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
identical(pred2a$labels, pred2b$labels) 
## [1] TRUE

When working on a cluster, passing BatchtoolsParam() to SingleR() allows us to seamlessly interface with various job schedulers like SLURM, LSF and so on. This permits heavy-duty parallelization across hundreds of CPUs for highly intensive jobs, though often some configuration is required - see the vignette for more details.

7.3 Approximate algorithms

It is possible to sacrifice accuracy to squeeze more speed out of SingleR. The most obvious approach is to simply turn off the fine-tuning with fine.tune=FALSE, which avoids the time-consuming fine-tuning iterations. When the reference labels are well-separated, this is probably an acceptable trade-off.

pred3a <- SingleR(sce, ref=dice, assay.type.test=1, 
    labels=dice$label.main, fine.tune=FALSE)
##       B cells     Monocytes      NK cells T cells, CD4+ T cells, CD8+ 
##           348           705           357           950           340

Another approximation is based on the fact that the initial score calculation is done using a nearest-neighbors search. By default, this is an exact seach but we can switch to an approximate algorithm via the BNPARAM= argument. In the example below, we use the Annoy algorithm via the BiocNeighbors framework, which yields mostly similar results. (Note, though, that the Annoy method does involve a considerable amount of overhead, so for small jobs it will actually be slower than the exact search.)

pred3b <- SingleR(sce, ref=dice, assay.type.test=1, 
    labels=dice$label.main, fine.tune=FALSE, # for comparison with pred3a.
table(pred3a$labels, pred3b$labels)
##                 B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
##   B cells           348         0        0             0             0
##   Monocytes           0       705        0             0             0
##   NK cells            0         0      357             0             0
##   T cells, CD4+       0         0        0           950             0
##   T cells, CD8+       0         0        0             0           340

7.4 Cluster-level annotation

The default philosophy of SingleR is to perform annotation of each individual cell in the test dataset. An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells. To demonstrate, we will perform a quick-and-dirty clustering of our PBMC dataset with a variety of Bioconductor packages.

sce <- logNormCounts(sce)

dec <- modelGeneVarByPoisson(sce)
sce <- denoisePCA(sce, dec, subset.row=getTopHVGs(dec, n=5000))

colLabels(sce) <- clusterRows(reducedDim(sce), NNGraphParam())

sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="label")

By passing clusters= to SingleR(), we direct the function to compute an aggregated profile per cluster. Annotation is then performed on the cluster-level profiles rather than on the single-cell level. This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells; it is also easier to interpret as it directly returns the likely cell type identity of each cluster.

SingleR(sce, dice, clusters=colLabels(sce), labels=dice$label.main)
## DataFrame with 11 rows and 4 columns
##                             scores        labels pruned.labels
##                           <matrix>   <character>  <numeric>   <character>
## 1  0.1545149:0.261593:0.601408:... T cells, CD4+  0.0295797 T cells, CD4+
## 2  0.2095510:0.236086:0.356562:... T cells, CD4+  0.0449275 T cells, CD4+
## 3  0.0526260:0.271140:0.727792:...      NK cells  0.3343791      NK cells
## 4  0.1590303:0.761772:0.212253:...     Monocytes  0.5495187     Monocytes
## 5  0.1450583:0.782498:0.205440:...     Monocytes  0.5770585     Monocytes
## 6  0.6409026:0.332129:0.224033:...       B cells  0.3087738       B cells
## 7  0.2020630:0.275696:0.412119:... T cells, CD4+  0.1159168 T cells, CD4+
## 8  0.2275805:0.602347:0.211547:...     Monocytes  0.3747668     Monocytes
## 9  0.1679136:0.753156:0.259175:...     Monocytes  0.4939805     Monocytes
## 10 0.2535745:0.258933:0.328738:... T cells, CD4+  0.0569244 T cells, CD4+
## 11 0.0713926:0.223101:0.117047:...     Monocytes  0.1060540            NA

This approach assumes that each cluster in the test dataset corresponds to exactly one reference label. If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label. (We note that it would be very difficult to determine the composition of the mixture from the SingleR() scores.) Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels, decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset. The default per-cell strategy is safer and provides more information about the ambiguity of the annotations, which is important for closely related labels where a close correspondence between clusters and labels cannot be expected.

Session information

R version 4.4.0 beta (2024-04-15 r86425)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/ 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scater_1.32.0               ggplot2_3.5.1              
 [3] bluster_1.14.0              scran_1.32.0               
 [5] scuttle_1.14.0              BiocNeighbors_1.22.0       
 [7] BiocParallel_1.38.0         SingleR_2.6.0              
 [9] TENxPBMCData_1.21.0         HDF5Array_1.32.0           
[11] rhdf5_2.48.0                DelayedArray_0.30.0        
[13] SparseArray_1.4.0           S4Arrays_1.4.0             
[15] abind_1.4-5                 Matrix_1.7-0               
[17] SingleCellExperiment_1.26.0 ensembldb_2.28.0           
[19] AnnotationFilter_1.28.0     GenomicFeatures_1.56.0     
[21] AnnotationDbi_1.66.0        celldex_1.13.3             
[23] SummarizedExperiment_1.34.0 Biobase_2.64.0             
[25] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
[27] IRanges_2.38.0              S4Vectors_0.42.0           
[29] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
[31] matrixStats_1.3.0           BiocStyle_2.32.0           
[33] rebook_1.14.0              

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.8            CodeDepends_0.6.6        
  [3] magrittr_2.0.3            ggbeeswarm_0.7.2         
  [5] gypsum_1.0.0              farver_2.1.1             
  [7] rmarkdown_2.26            BiocIO_1.14.0            
  [9] zlibbioc_1.50.0           vctrs_0.6.5              
 [11] memoise_2.0.1             Rsamtools_2.20.0         
 [13] DelayedMatrixStats_1.26.0 RCurl_1.98-1.14          
 [15] htmltools_0.5.8.1         AnnotationHub_3.12.0     
 [17] curl_5.2.1                Rhdf5lib_1.26.0          
 [19] sass_0.4.9                alabaster.base_1.4.0     
 [21] bslib_0.7.0               httr2_1.0.1              
 [23] cachem_1.0.8              GenomicAlignments_1.40.0 
 [25] igraph_2.0.3              mime_0.12                
 [27] lifecycle_1.0.4           pkgconfig_2.0.3          
 [29] rsvd_1.0.5                R6_2.5.1                 
 [31] fastmap_1.1.1             GenomeInfoDbData_1.2.12  
 [33] digest_0.6.35             colorspace_2.1-0         
 [35] paws.storage_0.5.0        dqrng_0.3.2              
 [37] irlba_2.3.5.1             ExperimentHub_2.12.0     
 [39] RSQLite_2.3.6             beachmat_2.20.0          
 [41] labeling_0.4.3            filelock_1.0.3           
 [43] fansi_1.0.6               httr_1.4.7               
 [45] compiler_4.4.0            bit64_4.0.5              
 [47] withr_3.0.0               viridis_0.6.5            
 [49] DBI_1.2.2                 highr_0.10               
 [51] alabaster.ranges_1.4.0    alabaster.schemas_1.4.0  
 [53] rappdirs_0.3.3            rjson_0.2.21             
 [55] tools_4.4.0               vipor_0.4.7              
 [57] beeswarm_0.4.0            glue_1.7.0               
 [59] restfulr_0.0.15           rhdf5filters_1.16.0      
 [61] grid_4.4.0                Rtsne_0.17               
 [63] cluster_2.1.6             generics_0.1.3           
 [65] snow_0.4-4                gtable_0.3.5             
 [67] metapod_1.12.0            BiocSingular_1.20.0      
 [69] ScaledMatrix_1.12.0       utf8_1.2.4               
 [71] XVector_0.44.0            ggrepel_0.9.5            
 [73] BiocVersion_3.19.1        pillar_1.9.0             
 [75] limma_3.60.0              dplyr_1.1.4              
 [77] BiocFileCache_2.12.0      lattice_0.22-6           
 [79] rtracklayer_1.64.0        bit_4.0.5                
 [81] tidyselect_1.2.1          paws.common_0.7.2        
 [83] locfit_1.5-9.9            Biostrings_2.72.0        
 [85] knitr_1.46                gridExtra_2.3            
 [87] bookdown_0.39             ProtGenerics_1.36.0      
 [89] edgeR_4.2.0               xfun_0.43                
 [91] statmod_1.5.0             UCSC.utils_1.0.0         
 [93] lazyeval_0.2.2            yaml_2.3.8               
 [95] evaluate_0.23             codetools_0.2-20         
 [97] tibble_3.2.1              alabaster.matrix_1.4.0   
 [99] BiocManager_1.30.22       graph_1.82.0             
[101] cli_3.6.2                 munsell_0.5.1            
[103] jquerylib_0.1.4           Rcpp_1.0.12              
[105] dir.expiry_1.12.0         dbplyr_2.5.0             
[107] png_0.1-8                 XML_3.99-0.16.1          
[109] parallel_4.4.0            blob_1.2.4               
[111] sparseMatrixStats_1.16.0  bitops_1.0-7             
[113] viridisLite_0.4.2         alabaster.se_1.4.0       
[115] scales_1.3.0              purrr_1.0.2              
[117] crayon_1.5.2              rlang_1.1.3              
[119] cowplot_1.1.3             KEGGREST_1.44.0          


Schmiedel, Benjamin J., Divya Singh, Ariel Madrigal, Alan G. Valdovino-Gonzalez, Brandie M. White, Jose Zapardiel-Gonzalo, Brendan Ha, et al. 2018. “Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.” Cell 175 (6): 1701–1715.e16.