Chapter 37 Paul mouse HSC (MARS-seq)

37.1 Introduction

This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with MARS-seq (Paul et al. 2015). Cells were extracted from multiple mice under different experimental conditions (i.e., sorting protocols) and libraries were prepared using a series of 384-well plates.

37.2 Data loading

After loading and annotation, we inspect the resulting SingleCellExperiment object:

## class: SingleCellExperiment 
## dim: 17483 10368 
## metadata(0):
## assays(1): counts
## rownames(17483): ENSMUSG00000007777 ENSMUSG00000107002 ...
##   ENSMUSG00000039068 ENSMUSG00000064363
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(10368): W29953 W29954 ... W76335 W76336
## colData names(13): Well_ID Seq_batch_ID ... CD34_measurement
##   FcgR3_measurement
## reducedDimNames(0):
## altExpNames(0):

37.3 Quality control

For some reason, only one mitochondrial transcripts are available, so we will perform quality control using only the library size and number of detected features. Ideally, we would simply block on the plate of origin to account for differences in processing, but unfortunately, it seems that many plates have a large proportion (if not outright majority) of cells with poor values for both metrics. We identify such plates based on the presence of very low outlier thresholds, for some arbitrary definition of “low”; we then redefine thresholds using information from the other (presumably high-quality) plates.

We examine the number of cells discarded for each reason.

##   low_lib_size low_n_features        discard 
##           1695           1781           1783

We create some diagnostic plots for each metric (Figure 37.1).

Distribution of each QC metric across cells in the Paul HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded.

Figure 37.1: Distribution of each QC metric across cells in the Paul HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded.

37.4 Normalization

We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 37.2).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.057   0.422   0.775   1.000   1.335   9.654
Relationship between the library size factors and the deconvolution size factors in the Paul HSC dataset.

Figure 37.2: Relationship between the library size factors and the deconvolution size factors in the Paul HSC dataset.

37.5 Variance modelling

We fit a mean-variance trend to the endogenous genes to detect highly variable genes. Unfortunately, the plates are confounded with an experimental treatment (Batch_desc) so we cannot block on the plate of origin.

Per-gene variance as a function of the mean for the log-expression values in the Paul HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson noise.

Figure 37.3: Per-gene variance as a function of the mean for the log-expression values in the Paul HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson noise.

37.7 Clustering

These is a strong relationship between the cluster and the experimental treatment (Figure 37.4), which is to be expected. Of course, this may also be attributable to some batch effect; the confounded nature of the experimental design makes it difficult to make any confident statements either way.

Heatmap of the distribution of cells across clusters (rows) for each experimental treatment (column).

Figure 37.4: Heatmap of the distribution of cells across clusters (rows) for each experimental treatment (column).

Obligatory $t$-SNE plot of the Paul HSC dataset, where each point represents a cell and is colored according to the assigned cluster.

Figure 37.5: Obligatory \(t\)-SNE plot of the Paul HSC dataset, where each point represents a cell and is colored according to the assigned cluster.

Obligatory $t$-SNE plot of the Paul HSC dataset faceted by the treatment condition, where each point represents a cell and is colored according to the assigned cluster.

Figure 37.6: Obligatory \(t\)-SNE plot of the Paul HSC dataset faceted by the treatment condition, where each point represents a cell and is colored according to the assigned cluster.

Session Info

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scran_1.18.0                scater_1.18.0              
 [3] ggplot2_3.3.2               AnnotationHub_2.22.0       
 [5] BiocFileCache_1.14.0        dbplyr_1.4.4               
 [7] ensembldb_2.14.0            AnnotationFilter_1.14.0    
 [9] GenomicFeatures_1.42.0      AnnotationDbi_1.52.0       
[11] scRNAseq_2.4.0              SingleCellExperiment_1.12.0
[13] SummarizedExperiment_1.20.0 Biobase_2.50.0             
[15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.0        
[17] IRanges_2.24.0              S4Vectors_0.28.0           
[19] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
[21] matrixStats_0.57.0          BiocStyle_2.18.0           
[23] rebook_1.0.0               

loaded via a namespace (and not attached):
  [1] Rtsne_0.15                    ggbeeswarm_0.6.0             
  [3] colorspace_1.4-1              ellipsis_0.3.1               
  [5] scuttle_1.0.0                 bluster_1.0.0                
  [7] XVector_0.30.0                BiocNeighbors_1.8.0          
  [9] farver_2.0.3                  bit64_4.0.5                  
 [11] interactiveDisplayBase_1.28.0 xml2_1.3.2                   
 [13] codetools_0.2-16              sparseMatrixStats_1.2.0      
 [15] knitr_1.30                    Rsamtools_2.6.0              
 [17] pheatmap_1.0.12               graph_1.68.0                 
 [19] shiny_1.5.0                   BiocManager_1.30.10          
 [21] compiler_4.0.3                httr_1.4.2                   
 [23] dqrng_0.2.1                   assertthat_0.2.1             
 [25] Matrix_1.2-18                 fastmap_1.0.1                
 [27] lazyeval_0.2.2                limma_3.46.0                 
 [29] later_1.1.0.1                 BiocSingular_1.6.0           
 [31] htmltools_0.5.0               prettyunits_1.1.1            
 [33] tools_4.0.3                   igraph_1.2.6                 
 [35] rsvd_1.0.3                    gtable_0.3.0                 
 [37] glue_1.4.2                    GenomeInfoDbData_1.2.4       
 [39] dplyr_1.0.2                   rappdirs_0.3.1               
 [41] Rcpp_1.0.5                    vctrs_0.3.4                  
 [43] Biostrings_2.58.0             ExperimentHub_1.16.0         
 [45] rtracklayer_1.50.0            DelayedMatrixStats_1.12.0    
 [47] xfun_0.19                     stringr_1.4.0                
 [49] ps_1.4.0                      beachmat_2.6.0               
 [51] mime_0.9                      lifecycle_0.2.0              
 [53] irlba_2.3.3                   statmod_1.4.35               
 [55] XML_3.99-0.5                  edgeR_3.32.0                 
 [57] zlibbioc_1.36.0               scales_1.1.1                 
 [59] hms_0.5.3                     promises_1.1.1               
 [61] ProtGenerics_1.22.0           RColorBrewer_1.1-2           
 [63] yaml_2.2.1                    curl_4.3                     
 [65] gridExtra_2.3                 memoise_1.1.0                
 [67] biomaRt_2.46.0                stringi_1.5.3                
 [69] RSQLite_2.2.1                 highr_0.8                    
 [71] BiocVersion_3.12.0            BiocParallel_1.24.0          
 [73] rlang_0.4.8                   pkgconfig_2.0.3              
 [75] bitops_1.0-6                  evaluate_0.14                
 [77] lattice_0.20-41               purrr_0.3.4                  
 [79] labeling_0.4.2                GenomicAlignments_1.26.0     
 [81] CodeDepends_0.6.5             cowplot_1.1.0                
 [83] bit_4.0.4                     processx_3.4.4               
 [85] tidyselect_1.1.0              magrittr_1.5                 
 [87] bookdown_0.21                 R6_2.5.0                     
 [89] generics_0.1.0                DelayedArray_0.16.0          
 [91] DBI_1.1.0                     pillar_1.4.6                 
 [93] withr_2.3.0                   RCurl_1.98-1.2               
 [95] tibble_3.0.4                  crayon_1.3.4                 
 [97] rmarkdown_2.5                 viridis_0.5.1                
 [99] progress_1.2.2                locfit_1.5-9.4               
[101] grid_4.0.3                    blob_1.2.1                   
[103] callr_3.5.1                   digest_0.6.27                
[105] xtable_1.8-4                  httpuv_1.5.4                 
[107] openssl_1.4.3                 munsell_0.5.0                
[109] viridisLite_0.3.0             beeswarm_0.2.3               
[111] vipor_0.4.5                   askpass_1.1                  

Bibliography

Paul, F., Y. Arkin, A. Giladi, D. A. Jaitin, E. Kenigsberg, H. Keren-Shaul, D. Winter, et al. 2015. “Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors.” Cell 163 (7): 1663–77.