Chapter 5 Overview
This chapter provides an overview of the framework of a typical scRNA-seq analysis workflow (Figure 5.1). Subsequent chapters will describe each analysis step in more detail.
5.2 Experimental Design
Before starting the analysis itself, some comments on experimental design may be helpful. The most obvious question is the choice of technology, which can be roughly divided into:
- Droplet-based: 10X Genomics, inDrop, Drop-seq
- Plate-based with unique molecular identifiers (UMIs): CEL-seq, MARS-seq
- Plate-based with reads: Smart-seq2
- Other: sci-RNA-seq, Seq-Well
Each of these methods have their own advantages and weaknesses that are discussed extensively elsewhere (Mereu et al. 2019; Ziegenhain et al. 2017). In practical terms, droplet-based technologies are the current de facto standard due to their throughput and low cost per cell. Plate-based methods can capture other phenotypic information (e.g., morphology) and are more amenable to customization. Read-based methods provide whole-transcript coverage, which is useful in some applications (e.g., splicing, exome mutations); otherwise, UMI-based methods are more popular as they mitigate the effects of PCR amplification noise. The choice of method is left to the reader’s circumstances - we will simply note that most aspects of our analysis pipeline are technology-agnostic.
The next question is how many cells should be captured, and to what depth they should be sequenced. The short answer is “as much as you can afford to spend”. The long answer is that it depends on the aim of the analysis. If we are aiming to discover rare cell subpopulations, then we need more cells. If we are aiming to characterize subtle differences, then we need more sequencing depth. As of time of writing, an informal survey of the literature suggests that typical droplet-based experiments would capture anywhere from 10,000 to 100,000 cells, sequenced at anywhere from 1,000 to 10,000 UMIs per cell (usually in inverse proportion to the number of cells). Droplet-based methods also have a trade-off between throughput and doublet rate that affects the true efficiency of sequencing.
For studies involving multiple samples or conditions, the design considerations are the same as those for bulk RNA-seq experiments. There should be multiple biological replicates for each condition and conditions should not be confounded with batch. Note that individual cells are not replicates; rather, we are referring to samples derived from replicate donors or cultures.
5.3 Obtaining a count matrix
Sequencing data from scRNA-seq experiments must be converted into a matrix of expression values that can be used for statistical analysis. Given the discrete nature of sequencing data, this is usually a count matrix containing the number of UMIs or reads mapped to each gene in each cell. The exact procedure for quantifying expression tends to be technology-dependent:
- For 10X Genomics data, the
CellRangersoftware suite provides a custom pipeline to obtain a count matrix. This uses STAR to align reads to the reference genome and then counts the number of unique UMIs mapped to each gene.
- Pseudo-alignment methods such as
alevincan be used to obtain a count matrix from the same data with greater efficiency. This avoids the need for explicit alignment, which reduces the compute time and memory usage.
- For other highly multiplexed protocols, the scPipe package provides a more general pipeline for processing scRNA-seq data. This uses the Rsubread aligner to align reads and then counts UMIs per gene.
- For CEL-seq or CEL-seq2 data, the scruff package provides a dedicated pipeline for quantification.
- For read-based protocols, we can generally re-use the same pipelines for processing bulk RNA-seq data.
- For any data involving spike-in transcripts, the spike-in sequences should be included as part of the reference genome during alignment and quantification.
After quantification, we import the count matrix into R and create a
This can be done with base methods (e.g.,
read.table()) followed by applying the
Alternatively, for specific file formats, we can use dedicated methods from the DropletUtils (for 10X data) or tximport/tximeta packages (for pseudo-alignment methods).
Depending on the origin of the data, this requires some vigilance:
- Some feature-counting tools will report mapping statistics in the count matrix (e.g., the number of unaligned or unassigned reads).
While these values can be useful for quality control, they would be misleading if treated as gene expression values.
Thus, they should be removed (or at least moved to the
colData) prior to further analyses.
- Be careful of using the
^ERCCregular expression to detect spike-in rows in human data where the row names of the count matrix are gene symbols. An ERCC gene family actually exists in human annotation, so this would result in incorrect identification of genes as spike-in transcripts. This problem can be avoided by using count matrices with standard identifiers (e.g., Ensembl, Entrez).
5.4 Data processing and downstream analysis
In the simplest case, the workflow has the following form:
- We compute quality control metrics to remove low-quality cells that would interfere with downstream analyses. These cells may have been damaged during processing or may not have been fully captured by the sequencing protocol. Common metrics includes the total counts per cell, the proportion of spike-in or mitochondrial reads and the number of detected features.
- We convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency). This allows us to perform explicit comparisons across cells in downstream steps like clustering. We also apply a transformation, typically log, to adjust for the mean-variance relationship.
- We perform feature selection to pick a subset of interesting features for downstream analysis. This is done by modelling the variance across cells for each gene and retaining genes that are highly variable. The aim is to reduce computational overhead and noise from uninteresting genes.
- We apply dimensionality reduction to compact the data and further reduce noise. Principal components analysis is typically used to obtain an initial low-rank representation for more computational work, followed by more aggressive methods like \(t\)-stochastic neighbor embedding for visualization purposes.
- We cluster cells into groups according to similarities in their (normalized) expression profiles. This aims to obtain groupings that serve as empirical proxies for distinct biological states. We typically interpret these groupings by identifying differentially expressed marker genes between clusters.
Additional steps such as data integration and cell annotation will be discussed in their respective chapters.
5.5 Quick start
Here, we use the a droplet-based retina dataset from Macosko et al. (2015), provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) in preparation for biological interpretation. Similar workflows are available in abbreviated form in the Workflows.,
library(scRNAseq) sce <- MacoskoRetinaData() # Quality control. library(scater) is.mito <- grepl("^MT-", rownames(sce)) qcstats <- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) filtered <- quickPerCellQC(qcstats, percent_subsets="subsets_Mito_percent") sce <- sce[, !filtered$discard] # Normalization. sce <- logNormCounts(sce) # Feature selection. library(scran) dec <- modelGeneVar(sce) hvg <- getTopHVGs(dec, prop=0.1) # Dimensionality reduction. set.seed(1234) sce <- runPCA(sce, ncomponents=25, subset_row=hvg) sce <- runUMAP(sce, dimred = 'PCA', external_neighbors=TRUE) # Clustering. g <- buildSNNGraph(sce, use.dimred = 'PCA') colLabels(sce) <- factor(igraph::cluster_louvain(g)$membership) # Visualization. plotUMAP(sce, colour_by="label")
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=C  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  scran_1.18.0 scater_1.18.0  ggplot2_3.3.2 scRNAseq_2.4.0  SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0  Biobase_2.50.0 GenomicRanges_1.42.0  GenomeInfoDb_1.26.0 IRanges_2.24.0  S4Vectors_0.28.0 BiocGenerics_0.36.0  MatrixGenerics_1.2.0 matrixStats_0.57.0  BiocStyle_2.18.0 rebook_1.0.0 loaded via a namespace (and not attached):  ggbeeswarm_0.6.0 colorspace_1.4-1  ellipsis_0.3.1 scuttle_1.0.0  bluster_1.0.0 XVector_0.30.0  BiocNeighbors_1.8.0 farver_2.0.3  bit64_4.0.5 RSpectra_0.16-0  interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0  xml2_1.3.2 codetools_0.2-16  sparseMatrixStats_1.2.0 knitr_1.30  Rsamtools_2.6.0 dbplyr_1.4.4  uwot_0.1.8 graph_1.68.0  shiny_1.5.0 BiocManager_1.30.10  compiler_4.0.3 httr_1.4.2  dqrng_0.2.1 assertthat_0.2.1  Matrix_1.2-18 fastmap_1.0.1  lazyeval_0.2.2 limma_3.46.0  later_220.127.116.11 BiocSingular_1.6.0  htmltools_0.5.0 prettyunits_1.1.1  tools_4.0.3 igraph_1.2.6  rsvd_1.0.3 gtable_0.3.0  glue_1.4.2 GenomeInfoDbData_1.2.4  dplyr_1.0.2 rappdirs_0.3.1  Rcpp_1.0.5 vctrs_0.3.4  Biostrings_2.58.0 ExperimentHub_1.16.0  rtracklayer_1.50.0 DelayedMatrixStats_1.12.0  xfun_0.19 stringr_1.4.0  ps_1.4.0 beachmat_2.6.0  mime_0.9 lifecycle_0.2.0  irlba_2.3.3 ensembldb_2.14.0  statmod_1.4.35 XML_3.99-0.5  edgeR_3.32.0 AnnotationHub_2.22.0  zlibbioc_1.36.0 scales_1.1.1  hms_0.5.3 promises_1.1.1  ProtGenerics_1.22.0 AnnotationFilter_1.14.0  yaml_2.2.1 curl_4.3  gridExtra_2.3 memoise_1.1.0  biomaRt_2.46.0 stringi_1.5.3  RSQLite_2.2.1 BiocVersion_3.12.0  highr_0.8 GenomicFeatures_1.42.0  BiocParallel_1.24.0 rlang_0.4.8  pkgconfig_2.0.3 bitops_1.0-6  evaluate_0.14 lattice_0.20-41  purrr_0.3.4 labeling_0.4.2  GenomicAlignments_1.26.0 CodeDepends_0.6.5  cowplot_1.1.0 bit_4.0.4  processx_3.4.4 tidyselect_1.1.0  magrittr_1.5 bookdown_0.21  R6_2.5.0 generics_0.1.0  DelayedArray_0.16.0 DBI_1.1.0  pillar_1.4.6 withr_2.3.0  RCurl_1.98-1.2 tibble_3.0.4  crayon_1.3.4 BiocFileCache_1.14.0  rmarkdown_2.5 viridis_0.5.1  progress_1.2.2 locfit_1.5-9.4  grid_4.0.3 blob_1.2.1  callr_3.5.1 digest_0.6.27  xtable_1.8-4 httpuv_1.5.4  openssl_1.4.3 munsell_0.5.0  viridisLite_0.3.0 beeswarm_0.2.3  vipor_0.4.5 askpass_1.1
Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.
Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J. MacCarthy, Adrian Alvarez, Eduard Batlle, et al. 2019. “Benchmarking Single-Cell Rna Sequencing Protocols for Cell Atlas Projects.” bioRxiv. https://doi.org/10.1101/630087.
Ziegenhain, C., B. Vieth, S. Parekh, B. Reinius, A. Guillaumet-Adkins, M. Smets, H. Leonhardt, H. Heyn, I. Hellmann, and W. Enard. 2017. “Comparative Analysis of Single-Cell RNA Sequencing Methods.” Mol. Cell 65 (4): 631–43.