Chapter 5 Using multiple references
In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators.
Several strategies are available to combine inferences from multiple references:
- using reference-specific labels in a combined reference
- using harmonized labels in a combined reference
- combining scores across multiple references
This chapter discusses the various strengths and weaknesses of each strategy and provides some practical demonstrations of each. Here, we will use the HPCA and Blueprint/ENCODE datasets as our references and (yet another) PBMC dataset as the test.
5.2 Using reference-specific labels
In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, “B cell” - in reference dataset X is considered to be different from a “B cell” label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts.
We can easily implement this approach by combining the expression matrices together
and pasting the reference name onto the corresponding character vector of labels.
This modification ensures that the downstream
will treat each label-reference combination as a distinct entity.
It is then straightforward to perform annotation with the usual methods.
## ## BPE.B-cells BPE.CD4+ T-cells BPE.CD8+ T-cells BPE.HSC ## 1179 1708 2656 20 ## BPE.Monocytes BPE.NK cells HPCA.HSC_-G-CSF HPCA.Platelets ## 2348 460 1 7 ## HPCA.T_cells ## 2
However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin.
5.3 Comparing scores across references
5.3.1 Combining inferences from individual references
Another strategy - and the default approach implemented in
involves performing classification separately within each reference,
and then collating the results to choose the label with the highest score across references.
This is a relatively expedient approach that avoids the need for explicit harmonization
while also reducing exposure to reference-specific batch effects.
To use this method, we simply pass multiple objects to the
label= argument in
The combining strategy is as follows:
- The function first annotates the test dataset with each reference individually
in the same manner as described in Section 1.2.
This step is almost equivalent to simply looping over all individual references and running
- For each cell, the function collects its predicted labels across all references. In doing so, it also identifies the union of markers that are upregulated in the predicted label in each reference.
- The function identifies the overall best-scoring label as the final prediction for that cell. This step involves a recomputation of the scores across the identified marker subset to ensure that these scores are derived from the same set of genes (and are thus comparable across references).
The function will then return a
DataFrame of combined results for each cell in the test dataset,
including the overall label and the reference from which it was assigned.
## ## B-cells B_cell CD4+ T-cells CD8+ T-cells ## 1170 14 1450 2936 ## GMP HSC Monocyte Monocytes ## 1 22 753 1560 ## NK cells NK_cell Platelets Pre-B_cell_CD34- ## 372 10 9 16 ## T_cells ## 68
## ## 1 2 ## 7510 871
The main appeal of this approach lies in the fact that it is based on the results of annotation with individual references. This avoids batch effects from comparing expression values across references; it reduces the need for any coordination in the label scheme between references; and simultaneously provides the per-reference annotations in the results. The last feature is particularly useful as it allows for more detailed diagnostics, troubleshooting and further analysis.
##  "B-cells" "Monocytes" "CD8+ T-cells" "CD8+ T-cells" "Monocytes" ##  "Monocytes"
##  "B_cell" "Monocyte" "T_cells" "T_cells" "Monocyte" "Monocyte"
The main downside is that it is somewhat suboptimal if there are many labels that are unique to one reference, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference. The continued lack of consistency in the labels across references also complicates interpretation of the results, though we can overcome this by using harmonized labels as described below.
5.3.2 Combined diagnostics
All of the diagnostic plots in SingleR will naturally operate on these combined results.
For example, we can create a heatmap of the scores in all of the individual references
as well as for the recomputed scores in the combined results (Figure 5.1).
Note that scores are only recomputed for the labels predicted in the individual references,
so all labels outside of those are simply set to
NA - hence the swathes of grey.
The deltas for each individual reference can also be plotted with
plotDeltaDistribution() (Figure 5.2).
No deltas are shown for the recomputed scores as the assumption described in Section 4.3
may not be applicable across the predicted labels from the individual references.
For example, if all individual references suggest the same cell type with similar recomputed scores,
any delta would be low even though the assignment is highly confident.
We can similarly extract marker genes to use in heatmaps as described in Section 4.4.
As annotation was performed to each individual reference,
we can simply extract the marker genes from the nested
DataFrames as shown in Figure 5.3.
hpca.markers <- metadata(com.res2$orig.results$HPCA)$de.genes bpe.markers <- metadata(com.res2$orig.results$BPE)$de.genes mono.markers <- unique(unlist(hpca.markers$Monocyte, bpe.markers$Monocytes)) library(scater) plotHeatmap(logNormCounts(pbmc), order_columns_by=list(I(com.res2$labels)), features=mono.markers)
5.4 Using harmonized labels
5.4.2 Manual label harmonization
matchReferences() function provides a simple approach for label harmonization between two references.
Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed,
i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other and vice versa?
Probabilities close to 1 in Figure 5.4 indicate there is a 1:1 relation between that pair of labels;
on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.
This function can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, neurons are considered to be unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.
As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering.
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=C  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  scater_1.18.0 ggplot2_3.3.2  SingleR_1.4.0 ensembldb_2.14.0  AnnotationFilter_1.14.0 GenomicFeatures_1.42.0  AnnotationDbi_1.52.0 celldex_1.0.0  TENxPBMCData_1.8.0 HDF5Array_1.18.0  rhdf5_2.34.0 DelayedArray_0.16.0  Matrix_1.2-18 SingleCellExperiment_1.12.0  SummarizedExperiment_1.20.0 Biobase_2.50.0  GenomicRanges_1.42.0 GenomeInfoDb_1.26.0  IRanges_2.24.0 S4Vectors_0.28.0  BiocGenerics_0.36.0 MatrixGenerics_1.2.0  matrixStats_0.57.0 BiocStyle_2.18.0  rebook_1.0.0 loaded via a namespace (and not attached):  ggbeeswarm_0.6.0 colorspace_1.4-1  ellipsis_0.3.1 scuttle_1.0.0  XVector_0.30.0 BiocNeighbors_1.8.0  farver_2.0.3 bit64_4.0.5  interactiveDisplayBase_1.28.0 xml2_1.3.2  codetools_0.2-16 sparseMatrixStats_1.2.0  knitr_1.30 Rsamtools_2.6.0  dbplyr_1.4.4 pheatmap_1.0.12  graph_1.68.0 shiny_1.5.0  BiocManager_1.30.10 compiler_4.0.3  httr_1.4.2 assertthat_0.2.1  fastmap_1.0.1 lazyeval_0.2.2  later_184.108.40.206 BiocSingular_1.6.0  htmltools_0.5.0 prettyunits_1.1.1  tools_4.0.3 rsvd_1.0.3  gtable_0.3.0 glue_1.4.2  GenomeInfoDbData_1.2.4 dplyr_1.0.2  rappdirs_0.3.1 Rcpp_1.0.5  vctrs_0.3.4 Biostrings_2.58.0  rhdf5filters_1.2.0 ExperimentHub_1.16.0  rtracklayer_1.50.0 DelayedMatrixStats_1.12.0  xfun_0.19 stringr_1.4.0  ps_1.4.0 beachmat_2.6.0  mime_0.9 lifecycle_0.2.0  irlba_2.3.3 XML_3.99-0.5  AnnotationHub_2.22.0 scales_1.1.1  zlibbioc_1.36.0 hms_0.5.3  promises_1.1.1 ProtGenerics_1.22.0  RColorBrewer_1.1-2 yaml_2.2.1  curl_4.3 gridExtra_2.3  memoise_1.1.0 biomaRt_2.46.0  stringi_1.5.3 RSQLite_2.2.1  highr_0.8 BiocVersion_3.12.0  BiocParallel_1.24.0 rlang_0.4.8  pkgconfig_2.0.3 bitops_1.0-6  evaluate_0.14 lattice_0.20-41  purrr_0.3.4 Rhdf5lib_1.12.0  labeling_0.4.2 GenomicAlignments_1.26.0  CodeDepends_0.6.5 bit_4.0.4  processx_3.4.4 tidyselect_1.1.0  magrittr_1.5 bookdown_0.21  R6_2.5.0 generics_0.1.0  DBI_1.1.0 withr_2.3.0  pillar_1.4.6 RCurl_1.98-1.2  tibble_3.0.4 crayon_1.3.4  BiocFileCache_1.14.0 rmarkdown_2.5  viridis_0.5.1 progress_1.2.2  grid_4.0.3 blob_1.2.1  callr_3.5.1 digest_0.6.27  xtable_1.8-4 httpuv_1.5.4  munsell_0.5.0 openssl_1.4.3  beeswarm_0.2.3 viridisLite_0.3.0  vipor_0.4.5 askpass_1.1