For details on the concept and technicalities of DS analysis, and the methods presented here, consider having a look at our publication:

Crowell HL, Soneson C*, Germain P-L*, Calini D,
Collin L, Raposo C, Malhotra D, and Robinson MD:
muscat detects subpopulation-specific state transitions from
multi-sample multi-condition single-cell transcriptomics data.
Nature Communications 11, 6077 (2020).
DOI: 10.1038/s41467-020-19894-4

Load packages


1 Introduction

1.1 What is DS analysis?

A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes (Stegle, Teichmann, and Marioni 2015). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.

Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell 2015). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.

1.2 Starting point

The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).

The approach presented here is modular and thus subpopulation labels could originate from an earlier step in the analysis, such as clustering (Duò, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2019) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019).

2 Getting started

2.1 Data description

For this vignette, we will use a SingleCellExperiment (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained before and after 6h-treatment with IFN-\(\beta\) (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

2.2 Loading the data

The Kang et al. (2018) dataset has been made available through Bioconductor’s ExperimentHub and can be loaded into R as follows: We first initialize a Hub instance to search for and load available data with the ExperimentHub function, and store the complete list of records in the variable eh. Using query, we then retrieve any records that match our keyword(s) of interest, as well as their corresponding accession ID (EH1234).

eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2024-04-29
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: character, SingleCellExperiment, BSseq
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1661"]]' 
##            title                                               
##   EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH2259 | Kang18_8vs8

Finally, we load the data of interest into R via [[ and the corresponding accession ID. The dataset contains >35,000 genes and ~29,000 cells:

(sce <- eh[["EH2259"]])
## class: SingleCellExperiment 
## dim: 35635 29065 
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):

2.3 Preprocessing

The scater package (McCarthy et al. 2017) provides a variety of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to

  • remove undetected genes
  • remove cells with very few or many detected genes
  • remove very lowly expressed genes
  • compute normalized expression values for visualization

For more thorough preprocessing, we refer to the Quality control with scater vignette.

# remove undetected genes
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
## [1] 18890 29065

We use perCellQCMetrics to compute various per-cell quality control metrics, and proceed with filtering cells and genes as noted above:

# calculate per-cell quality control (QC) metrics
qc <- perCellQCMetrics(sce)

# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
## [1] 18890 26820
# remove lowly expressed genes
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
## [1]  7118 26820

Finally, we use logNormCounts to calculate log\(_2\)-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming1 Note that, in this workflow, expression values are used for visualization only, and that differential analyses are performed on pseudobulks (section 3.3) or the count data directly (section 3.4)..

# compute sum-factors & normalize
sce <- computeLibraryFactors(sce)
sce <- logNormCounts(sce)

Alternatively, expression values could be obtained via vst (variance stabilizing transformation) from the sctransform package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values:

assays(sce)$vstresiduals <- vst(counts(sce), verbosity = FALSE)$y

By default, scater’s functions will try to access the assay data specified via argument exprs_values (default logcounts) for e.g. visualization and dimension reduction. When an alternative assay such as the vstresiduals above should be used, it is thus necessary to explicitly specify this, for example, via runUMAP(sce, exprs_values = "vstresiduals") to compute UMAP cell embeddings on the assay data compute above.

2.4 Data preparation

muscat expects a certain format of the input SCE. Specifically, the following cell metadata (colData) columns have to be provided:

  • "sample_id": unique sample identifiers (e.g., PeterPan_ref1, Nautilus_trt3, …)
  • "cluster_id": subpopulation (cluster) assignments (e.g., T cells, monocytes, …)
  • "group_id": experimental group/condition (e.g., control/treatment, healthy/diseased, …)
sce$id <- paste0(sce$stim, sce$ind)
(sce <- prepSCE(sce, 
    kid = "cell", # subpopulation assignments
    gid = "stim",  # group IDs (ctrl/stim)
    sid = "id",   # sample IDs (ctrl/stim.1234)
    drop = TRUE))  # drop all other colData columns
## class: SingleCellExperiment 
## dim: 7118 26820 
## metadata(1): experiment_info
## assays(2): counts logcounts
## rownames(7118): NOC2L HES4 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colData names(3): cluster_id sample_id group_id
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):

For consistency and easy accession throughout this vignette, we will store cluster and sample IDs, as well as the number of clusters and samples into the following simple variables:

nk <- length(kids <- levels(sce$cluster_id))
ns <- length(sids <- levels(sce$sample_id))
names(kids) <- kids; names(sids) <- sids

2.5 Data overview

2.5.1 Cluster-sample sizes

As we will be aggregating measurements at the cluster-sample level, it is of particular importance to check the number of cells captured for each such instance. While aggregateData (see Section 3.1) allows excluding cluster-sample combinations with less than a threshold number of cells, clusters or samples with overall very low cell-counts may be excluded from further analysis at this point already.

For the Kang et al. (2018) dataset, for example, one might consider removing the Dendritic cells and Megakaryocytes clusters, as these contain less than 50 cells across all samples.

# nb. of cells per cluster-sample
t(table(sce$cluster_id, sce$sample_id))
##            B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells
##   ctrl101      113             186         336          95               5
##   ctrl1015     476             783         919         226              11
##   ctrl1016     144             419         526         671              10
##   ctrl1039      30             116         202          30               5
##   ctrl107       51             222         197          31               2
##   ctrl1244     134             429        1215          82              28
##   ctrl1256     240             383        1136         156              12
##   ctrl1488     234             317        1343          78              17
##   stim101      144             222         437         121              20
##   stim1015     357             683         814         153              17
##   stim1016     129             361         426         600              10
##   stim1039      39             154         318          40               7
##   stim107       56             185         217          22               7
##   stim1244      94             318         980          46              19
##   stim1256     211             369        1047         133              16
##   stim1488     283             370        1658          73              34
##            FCGR3A+ Monocytes Megakaryocytes NK cells
##   ctrl101                 81             12       84
##   ctrl1015               232             25      208
##   ctrl1016               126             15      151
##   ctrl1039                28              5       20
##   ctrl107                 29              5       49
##   ctrl1244                53             19      131
##   ctrl1256                50             15      275
##   ctrl1488                99             21      120
##   stim101                126              7      120
##   stim1015               222             24      224
##   stim1016               124             12      239
##   stim1039                36             13       32
##   stim107                 36              4       51
##   stim1244                35             14      136
##   stim1256                73             21      257
##   stim1488               139             35      187

2.5.2 Dimension reduction

The dimension reductions (DR) available within the SCE can be accessed via reducedDims from the scater package. The data provided by Kang et al. (2018) already contains t-SNE coordinates; however, we can of course compute additional dimension reductions using one of scater’s runX functions:

# compute UMAP using 1st 20 PCs
sce <- runUMAP(sce, pca = 20)

Using scater’s plotReducedDim function, we can plot t-SNE and UMAP representations colored by cluster and group IDs, respectively. We additionally create a small wrapper function, .plot_dr(), to improve the readability of color legends and simplify the plotting theme:

# wrapper to prettify reduced dimension plots
.plot_dr <- function(sce, dr, col)
  plotReducedDim(sce, dimred = dr, colour_by = col) +
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 3))) +
    theme_minimal() + theme(aspect.ratio = 1)

For our dataset, the t-SNE and UMAP colored by cluster_ids show that cell-populations are well-separated from one another. IFN-\(\beta\) stimulation manifests as a severe shift in the low-dimensional projection of cells when coloring by group_ids, indicating widespread, genome-scale transcriptional changes.

# downsample to max. 100 cells per cluster
cs_by_k <- split(colnames(sce), sce$cluster_id)
cs100 <- unlist(sapply(cs_by_k, function(u) 
  sample(u, min(length(u), 100))))

# plot t-SNE & UMAP colored by cluster & group ID
for (dr in c("TSNE", "UMAP"))
  for (col in c("cluster_id", "group_id"))
    .plot_dr(sce[, cs100], dr, col)