Chapter 14 HCA human bone marrow (10X Genomics)

14.1 Introduction

Here, we use an example dataset from the Human Cell Atlas immune cell profiling project on bone marrow, which contains scRNA-seq data for 380,000 cells generated using the 10X Genomics technology. This is a fairly big dataset that represents a good use case for the techniques in Advanced Chapter 14.

14.2 Data loading

This dataset is loaded via the HCAData package, which provides a ready-to-use SingleCellExperiment object.

sce.bone <- HCAData('ica_bone_marrow', as.sparse=TRUE)
sce.bone$Donor <- sub("_.*", "", sce.bone$Barcode)

We use symbols in place of IDs for easier interpretation later.

rowData(sce.bone)$Chr <- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce.bone),
    column="SEQNAME", keytype="GENEID")

rownames(sce.bone) <- uniquifyFeatureNames(rowData(sce.bone)$ID,
    names = rowData(sce.bone)$Symbol)

14.3 Quality control

Cell calling was not performed (see here) so we will perform QC using all metrics and block on the donor of origin during outlier detection. We perform the calculation across multiple cores to speed things up.

bpp <- MulticoreParam(8)
sce.bone <- unfiltered <- addPerCellQC(sce.bone, BPPARAM=bpp,

qc <- quickPerCellQC(colData(sce.bone), batch=sce.bone$Donor,
sce.bone <- sce.bone[,!qc$discard]
unfiltered$discard <- qc$discard

    plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, x="Donor", y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent"),
Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded.

Figure 14.1: Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded.

plotColData(unfiltered, x="sum", y="subsets_Mito_percent", 
    colour_by="discard") + scale_x_log10()
Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded.

Figure 14.2: Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded.

14.4 Normalization

For a minor speed-up, we use already-computed library sizes rather than re-computing them from the column sums.

sce.bone <- logNormCounts(sce.bone, size_factors = sce.bone$sum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.05    0.47    0.65    1.00    0.89   42.38

14.5 Variance modeling

We block on the donor of origin to mitigate batch effects during HVG selection. We select a larger number of HVGs to capture any batch-specific variation that might be present.

dec.bone <- modelGeneVarByPoisson(sce.bone, 
    block=sce.bone$Donor, BPPARAM=bpp)
top.bone <- getTopHVGs(dec.bone, n=5000)
blocked.stats <- dec.bone$per.block
for (i in colnames(blocked.stats)) {
    current <- blocked.stats[[i]]
    plot(current$mean, current$total, main=i, pch=16, cex=0.5,
        xlab="Mean of log-expression", ylab="Variance of log-expression")
    curfit <- metadata(current)
    curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances.

Figure 14.3: Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances.

14.6 Data integration

Here we use multiple cores, randomized SVD and approximate nearest-neighbor detection to speed up this step.


merged.bone <- fastMNN(sce.bone, batch = sce.bone$Donor, subset.row = top.bone,
     BSPARAM=BiocSingular::RandomParam(deferred = TRUE), 

reducedDim(sce.bone, 'MNN') <- reducedDim(merged.bone, 'corrected')

We use the percentage of variance lost as a diagnostic measure:

##      MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7
## [1,]  0.007133  0.006508  0.000000  0.000000  0.000000  0.000000  0.000000
## [2,]  0.006314  0.006883  0.023528  0.000000  0.000000  0.000000  0.000000
## [3,]  0.005117  0.003096  0.005115  0.019703  0.000000  0.000000  0.000000
## [4,]  0.001991  0.001888  0.001890  0.001766  0.023451  0.000000  0.000000
## [5,]  0.002391  0.001914  0.001735  0.002805  0.002563  0.023692  0.000000
## [6,]  0.003053  0.003180  0.002958  0.002522  0.003211  0.003342  0.024807
## [7,]  0.001826  0.001591  0.002290  0.001881  0.001473  0.002174  0.001908
##      MantonBM8
## [1,]   0.00000
## [2,]   0.00000
## [3,]   0.00000
## [4,]   0.00000
## [5,]   0.00000
## [6,]   0.00000
## [7,]   0.03235

14.7 Dimensionality reduction

We set external_neighbors=TRUE to replace the internal nearest neighbor search in the UMAP implementation with our parallelized approximate search. We also set the number of threads to be used in the UMAP iterations.

sce.bone <- runUMAP(sce.bone, dimred="MNN",

14.8 Clustering

Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with \(k\)-means. We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method. If more resolution is required, we can increase centers in addition to using a lower k during graph construction.


colLabels(sce.bone) <- clusterRows(reducedDim(sce.bone, "MNN"),
    TwoStepParam(KmeansParam(centers=1000), NNGraphParam(k=5)))

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 18859 15812 36360 47699 26528 10869 65650 18584 35321  8009 14930  3601  4206 
##    14    15    16 
##  3155  4824  2318

We observe mostly balanced contributions from different samples to each cluster (Figure 14.4), consistent with the expectation that all samples are replicates from different donors.

tab <- table(Cluster=colLabels(sce.bone), Donor=sce.bone$Donor)
pheatmap(log10(tab+10), color=viridis::viridis(100))
Heatmap of log~10~-number of cells in each cluster (row) from each sample (column).

Figure 14.4: Heatmap of log10-number of cells in each cluster (row) from each sample (column).

# TODO: add scrambling option in scater's plotting functions.
scrambled <- sample(ncol(sce.bone))

    plotUMAP(sce.bone, colour_by="label", text_by="label"),
    plotUMAP(sce.bone[,scrambled], colour_by="Donor")
UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom).

Figure 14.5: UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom).

14.9 Differential expression

We identify marker genes for each cluster while blocking on the donor.

markers.bone <- findMarkers(sce.bone, block = sce.bone$Donor, 
    direction = 'up', lfc = 1, BPPARAM=bpp)

We visualize the top markers for a randomly chosen cluster using a “dot plot” in Figure 14.6. The presence of upregulated genes like LYZ, S100A8 and VCAN is consistent with a monocyte identity for this cluster.

top.markers <- markers.bone[["4"]]
best <- top.markers[top.markers$Top <= 10,]
lfcs <- getMarkerEffects(best)

pheatmap(lfcs, breaks=seq(-5, 5, length.out=101))
Heatmap of log~2~-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns).

Figure 14.6: Heatmap of log2-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns).

14.10 Cell type classification

We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile. This is faster than the per-cell approaches described in Chapter 10.9 at the cost of the resolution required to detect heterogeneity inside a cluster. Nonetheless, it is often sufficient for a quick assignment of cluster identity, and indeed, cluster 4 is also identified as consisting of monocytes from this analysis.

se.aggregated <- sumCountsAcrossCells(sce.bone, id=colLabels(sce.bone), BPPARAM=bpp)

hpc <- HumanPrimaryCellAtlasData()

anno.single <- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main,
## DataFrame with 16 rows and 4 columns
##                             scores      labels pruned.labels
##                           <matrix> <character>  <numeric>   <character>
## 1   0.384401:0.751148:0.651234:...         GMP  0.0913786           GMP
## 2   0.343557:0.567261:0.479100:...     T_cells  0.4298632       T_cells
## 3   0.323043:0.647364:0.558334:...     T_cells  0.0959201       T_cells
## 4   0.299294:0.745584:0.535751:...    Monocyte  0.2935059      Monocyte
## 5   0.310761:0.672644:0.540285:...      B_cell  0.6024293        B_cell
## ...                            ...         ...        ...           ...
## 12  0.294203:0.707235:0.528198:...    Monocyte  0.3586359      Monocyte
## 13  0.343741:0.731258:0.600058:...    Monocyte  0.1019188            NA
## 14  0.369798:0.652467:0.582201:...      B_cell  0.1976631            NA
## 15  0.378580:0.690882:0.781190:...         MEP  0.0614135           MEP
## 16  0.333963:0.679341:0.559147:...         GMP  0.1114087           GMP

Session Info

R version 4.4.0 beta (2024-04-15 r86425)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/ 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleR_2.6.0               celldex_1.14.0             
 [3] pheatmap_1.0.12             bluster_1.14.0             
 [5] BiocNeighbors_1.22.0        batchelor_1.20.0           
 [7] scran_1.32.0                BiocParallel_1.38.0        
 [9] scater_1.32.0               ggplot2_3.5.1              
[11] scuttle_1.14.0              EnsDb.Hsapiens.v86_2.99.0  
[13] ensembldb_2.28.0            AnnotationFilter_1.28.0    
[15] GenomicFeatures_1.56.0      AnnotationDbi_1.66.0       
[17] rhdf5_2.48.0                HCAData_1.20.0             
[19] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[21] Biobase_2.64.0              GenomicRanges_1.56.0       
[23] GenomeInfoDb_1.40.0         IRanges_2.38.0             
[25] S4Vectors_0.42.0            BiocGenerics_0.50.0        
[27] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[29] BiocStyle_2.32.0            rebook_1.14.0              

loaded via a namespace (and not attached):
  [1] BiocIO_1.14.0             bitops_1.0-7             
  [3] filelock_1.0.3            tibble_3.2.1             
  [5] CodeDepends_0.6.6         graph_1.82.0             
  [7] XML_3.99-0.16.1           lifecycle_1.0.4          
  [9] httr2_1.0.1               edgeR_4.2.0              
 [11] lattice_0.22-6            alabaster.base_1.4.0     
 [13] magrittr_2.0.3            limma_3.60.0             
 [15] sass_0.4.9                rmarkdown_2.26           
 [17] jquerylib_0.1.4           yaml_2.3.8               
 [19] metapod_1.12.0            cowplot_1.1.3            
 [21] DBI_1.2.2                 RColorBrewer_1.1-3       
 [23] ResidualMatrix_1.14.0     abind_1.4-5              
 [25] zlibbioc_1.50.0           purrr_1.0.2              
 [27] RCurl_1.98-1.14           rappdirs_0.3.3           
 [29] GenomeInfoDbData_1.2.12   ggrepel_0.9.5            
 [31] irlba_2.3.5.1             dqrng_0.3.2              
 [33] DelayedMatrixStats_1.26.0 codetools_0.2-20         
 [35] DelayedArray_0.30.0       tidyselect_1.2.1         
 [37] UCSC.utils_1.0.0          farver_2.1.1             
 [39] ScaledMatrix_1.12.0       viridis_0.6.5            
 [41] BiocFileCache_2.12.0      GenomicAlignments_1.40.0 
 [43] jsonlite_1.8.8            tools_4.4.0              
 [45] Rcpp_1.0.12               glue_1.7.0               
 [47] gridExtra_2.3             SparseArray_1.4.0        
 [49] xfun_0.43                 dplyr_1.1.4              
 [51] HDF5Array_1.32.0          gypsum_1.0.0             
 [53] withr_3.0.0               BiocManager_1.30.22      
 [55] fastmap_1.1.1             rhdf5filters_1.16.0      
 [57] fansi_1.0.6               digest_0.6.35            
 [59] rsvd_1.0.5                R6_2.5.1                 
 [61] mime_0.12                 colorspace_2.1-0         
 [63] RSQLite_2.3.6             paws.storage_0.5.0       
 [65] utf8_1.2.4                generics_0.1.3           
 [67] rtracklayer_1.64.0        httr_1.4.7               
 [69] S4Arrays_1.4.0            uwot_0.2.2               
 [71] pkgconfig_2.0.3           gtable_0.3.5             
 [73] blob_1.2.4                XVector_0.44.0           
 [75] htmltools_0.5.8.1         bookdown_0.39            
 [77] ProtGenerics_1.36.0       scales_1.3.0             
 [79] alabaster.matrix_1.4.0    png_0.1-8                
 [81] knitr_1.46                rjson_0.2.21             
 [83] curl_5.2.1                cachem_1.0.8             
 [85] BiocVersion_3.19.1        parallel_4.4.0           
 [87] vipor_0.4.7               restfulr_0.0.15          
 [89] pillar_1.9.0              grid_4.4.0               
 [91] alabaster.schemas_1.4.0   vctrs_0.6.5              
 [93] BiocSingular_1.20.0       dbplyr_2.5.0             
 [95] beachmat_2.20.0           cluster_2.1.6            
 [97] beeswarm_0.4.0            evaluate_0.23            
 [99] cli_3.6.2                 locfit_1.5-9.9           
[101] compiler_4.4.0            Rsamtools_2.20.0         
[103] rlang_1.1.3               crayon_1.5.2             
[105] paws.common_0.7.2         labeling_0.4.3           
[107] ggbeeswarm_0.7.2          viridisLite_0.4.2        
[109] alabaster.se_1.4.0        munsell_0.5.1            
[111] Biostrings_2.72.0         lazyeval_0.2.2           
[113] Matrix_1.7-0              dir.expiry_1.12.0        
[115] ExperimentHub_2.12.0      sparseMatrixStats_1.16.0 
[117] bit64_4.0.5               Rhdf5lib_1.26.0          
[119] KEGGREST_1.44.0           statmod_1.5.0            
[121] alabaster.ranges_1.4.0    highr_0.10               
[123] AnnotationHub_3.12.0      igraph_2.0.3             
[125] memoise_2.0.1             bslib_0.7.0              
[127] bit_4.0.5                

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1): 2128.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.

Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2): 208–22.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.

Messmer, T., F. von Meyenn, A. Savino, F. Santos, H. Mohammed, A. T. L. Lun, J. C. Marioni, and W. Reik. 2019. “Transcriptional heterogeneity in naive and primed human pluripotent stem cells at single-cell resolution.” Cell Rep 26 (4): 815–24.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4): 385–94.

Nestorowa, S., F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, and B. Gottgens. 2016. “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation.” Blood 128 (8): 20–31.

Paul, F., Y. Arkin, A. Giladi, D. A. Jaitin, E. Kenigsberg, H. Keren-Shaul, D. Winter, et al. 2015. “Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors.” Cell 163 (7): 1663–77.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1): 171–81.

Pollen, A. A., T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nat. Biotechnol. 32 (10): 1053–8.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4): 593–607.

Wajapeyee, N., S. Z. Wang, R. W. Serra, P. D. Solomon, A. Nagarajan, X. Zhu, and M. R. Green. 2010. “Senescence induction in human fibroblasts and hematopoietic progenitors by leukemogenic fusion proteins.” Blood 115 (24): 5057–60.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.

Zeng, H., E. H. Shen, J. G. Hohmann, S. W. Oh, A. Bernard, J. J. Royall, K. J. Glattfelder, et al. 2012. “Large-scale cellular-resolution gene profiling in human neocortex reveals species-specific molecular signatures.” Cell 149 (2): 483–96.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.