About This Document »

Package name cytofWorkflow
Built with Bioconductor (R) 3.6 (3.4.2)
Last Built Wed, 29 Nov 2017 04:35:12 -0800
Last Modified Mon, 27 Nov 2017 13:12:51 -0800 (r132217)
Source Package cytofWorkflow_1.0.4.tar.gz
Windows Binary cytofWorkflow_1.0.4.zip
R Script cytofWorkflow.R

To install this workflow under Bioconductor 3.6, start R and enter:


CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets

Malgorzata Nowicka1,2, Carsten Krieg3, Lukas M. Weber1,2, Felix J. Hartmann3, Silvia Guglietta4, Burkhard Becher3, Mitchell P. Levesque5 and Mark D. Robinson1,2*

1Institute for Molecular Life Sciences, University of Zurich, CH-8057 Zurich, Switzerland
2SIB Swiss Institute of Bioinformatics, University of Zurich, CH-8057 Zurich, Switzerland
3Institute of Experimental Immunology, University of Zurich, CH-8057 Zurich, Switzerland
4Department of Experimental Oncology, European Institute of Oncology, Via Adamello 16, I-20139 Milan, Italy
5Department of Dermatology, University Hospital Zurich, CH-8091 Zurich, Switzerland



High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations. Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals).

Table of Contents
Data description

1 Introduction

Flow cytometry and the more recently introduced CyTOF (cytometry by time-of-flight mass spectrometry or mass cytometry) are high-throughput technologies that measure protein abundance on the surface or within cells. In flow cytometry, antibodies are labeled with fluorescent dyes and fluorescence intensity is measured using lasers and photodetectors. CyTOF utilizes antibodies tagged with metal isotopes from the lanthanide series, which have favorable chemistry and do not occur in biological systems; abundances per cell are recorded with a time-of-flight mass spectrometer. In either case, fluorescence intensities (flow cytometry) or ion counts (mass cytometry) are assumed to be proportional to the expression level of the antibody-targeted antigens of interest.

Due to the differences in acquisition, further distinct characteristics should be noted. Conventional fluorophore-based flow cytometry is non-destructive and can be used to sort cells for further analysis. However, because of the spectral overlap between fluorophores, compensation of the data needs to be performed (Roederer 2001), which also limits the number of parameters that can be measured simultaneously. Thus, standard flow cytometry experiments measure 6-12 parameters with modern systems measuring up to 20 channels (Mahnke and Roederer 2007), while new developments (e.g. BD FACSymphony) promise to increase this capacity towards 50. Moreover, flow cytometry offers the highest throughput with tens of thousands of cells measured per second at relatively low operating costs per sample.

By using rare metal isotopes in CyTOF, cell autofluorescence can be avoided and spectral overlap is drastically reduced. However, the sensitivity of mass spectrometry results in the measurement of metal impurities and oxide formations, which need to be carefully considered in antibody panel design (e.g. through antibody concentrations and coupling of antibodies to neighboring metals). Leipold et al. recently commented that minimal spillover does not equal no spillover (Leipold 2015). Nonetheless, CyTOF offers a high dimension of parameters measured per cell, with current panels using ~40 parameters and the promise of up to 100. Throughput of CyTOF is lower, at the rate of hundreds of cells per second, and cells are destroyed during ionization.

The ability of flow cytometry and mass cytometry to analyze individual cells at high-throughput scales has resulted in a wide range of biological and medical applications. For example, immunophenotyping assays are used to detect and quantify cell populations of interest, to uncover new cell populations and compare abundance of cell populations between different conditions, for example between patient groups (Unen et al. 2016). Thus, it can be used as a biomarker discovery tool.

Various methodological approaches aim for biomarker discovery (Saeys, Gassen, and Lambrecht 2016). A common strategy, which we will refer to through this workflow as the “classic” approach, is to first identify cell populations of interest by manual gating or automated clustering (Hartmann et al. 2016; Pejoski et al. 2016). Second, using statistical tests, one can determine which of the cell subpopulations or protein markers are associated with a phenotype (e.g. clinical outcome) of interest. Typically, cell subpopulation abundance expressed as cluster cell counts or median marker expression would be used in the statistical model to relate to the sample-level phenotype.

Importantly, there are many alternatives to what we propose below, and several new methods are emerging. For instance, citrus (Bruggner et al. 2014) tackles the differential discovery problem by strong over-clustering of the cells, and by building a hierarchy of clusters from very specific to general ones. Using model selection and regularization techniques, clusters and markers that associate with the outcome are identified. A new machine learning approach, CellCnn (Arvaniti and Claassen 2017), learns the representation of clusters that are associated with the considered phenotype by means of convolutional neural networks, which makes it particularly applicable to detecting discriminating rare cell populations. However, there are tradeoffs to consider. citrus performs feature selection but does not provide significance levels, such as p-values, for the strength of associations. Due to its computational requirements, citrus can not be run on entire mass cytometry datasets and one typically must analyze a subset of the data. The “filters” from CellCnn may identify one or more cell subsets that distinguish experimental groups, while these groups may not necessarily correspond to any of the canonical cell types, since they are learned with a data-driven approach.

A noticeable distinction between the machine-learning approaches and our classical regression approach is how the model is designed. citrus and CellCnn model the patient response as a function of the measured HDCyto values, whereas the classical approach models the HDCyto data itself as the response, thus putting the distributional assumptions on the experimental HDCyto data. This carries the distinct advantage that covariates (e.g. age, gender, batch) can be included, together with finding associations of the phenotype to the predictors of interest (e.g. cell type abundance). Specifically, neither citrus nor CellCnn are able to directly account for complex designs, such as paired experiments or presence of batches.

Within the classical approach, hybrid methods are certainly possible, where discovery of interesting cell populations is done with one algorithm, and quantifications or signal aggregations are modeled in standard regression frameworks. For instance, CellCnn provides p-values from a t-test or Mann-Whitney U-test conducted on the frequencies of previously detected cell populations. The models we propose below are flexible extensions of this strategy.

Step by step, this workflow presents differential discovery analyses assembled from a suite of tools and methods that, in our view, lead to a higher level of flexibility and robust, statistically-supported and interpretable results. Cell population identification is conducted by means of unsupervised clustering using the FlowSOM and ConsensusClusterPlus packages, which together were among the best performing clustering approaches for high-dimensional cytometry data (Weber and Robinson 2016). Notably, FlowSOM scales easily to millions of cells and thus no subsetting of the data is required.

To be able to analyze arbitrary experimental designs (e.g. batch effects, paired experiments, etc.), we show how to conduct the differential analysis of cell population abundances using the generalized linear mixed models (GLMM) and of marker intensities using linear models (LM) and linear mixed models (LMM). Model fitting is performed with lme4 and stats packages, and hypothesis testing with the multcomp package.

We use the ggplot2 package as our graphical engine. Notably, we propose a suite of useful visual representations of HDCyto data characteristics, such as an MDS (multidimensional scaling) plot of aggregated signal for exploring sample similarities. The obtained cell populations are visualized using dimension reduction techniques (e.g. t-SNE via the Rtsne package) and heatmaps (via the pheatmap and ComplexHeatmap packages) to represent characteristics of the annotated cell populations and identified biomarkers.

The workflow is intentionally not fully automatic. First, we strongly advocate for exploratory data analysis to get an understanding of data characteristics before formal statistical modeling. Second, the workflow involves an optional step where the user can manually merge and annotate clusters (see Cluster merging and annotation section) but in a way that is easily reproducible. The CyTOF data used here (see Data description section) is already preprocessed; i.e. the normalization and de-barcoding, as well as removal of doublets, debris and dead cells, were already performed. To see how such an analysis could be performed, please see the Data preprocessing section.

Notably, this workflow is equally applicable to flow or mass cytometry datasets, for which the preprocessing steps have already been performed. In addition, the workflow is modular and can be adapted as new algorithms or new knowledge about how to best use existing tools comes to light. Alternative clustering algorithms such as the popular PhenoGraph algorithm (Levine et al. 2015) (e.g. via the Rphenograph package), dimensionality reduction techniques, such as diffusion maps (Haghverdi, Buettner, and Theis 2015) via the destiny package (Angerer et al. 2016), and SIMLR (Wang et al. 2017) via the SIMLR package could be inserted to the workflow.

Note: To cite this workflow, please refer to this F1000 article https://f1000research.com/articles/6-748/v1 .

Table of Contents
Data preprocessing

2 Data description

We use a subset of CyTOF data originating from Bodenmiller et al. (Bodenmiller et al. 2012) that was also used in the citrus paper (Bruggner et al. 2014). In the original study, peripheral blood mononuclear cells (PBMCs) in unstimulated and after 11 different stimulation conditions were measured for 8 healthy donors. For each sample, expression of 10 cell surface markers and 14 signaling markers was recorded. We perform our analysis on samples from the reference and one stimulated condition where cell were crosslinked for 30 minutes with B cell receptor/Fc receptor known as BCR/FcR-XL, resulting in 16 samples in total (8 patients, unstimulated and stimulated for each).

The original data is available from the Cytobank report. The subset used here can be downloaded from the Citrus Cytobank repository (files with _BCR-XL.fcs or _Reference.fcs endings) or from our web server (see Data import section).

In both the Bodenmiller et al. and citrus manuscripts, the 10 lineage markers were used to identify cell subpopulations. These were then investigated for differences between reference and stimulated cell subpopulations separately for each of the 14 functional markers. The same strategy is used in this workflow; 10 lineage markers are used for cell clustering and 14 functional markers are tested for differential expression between the reference and BCR/FcR-XL stimulation. Even though differential analysis of cell abundance was not in the scope of the Bodenmiller et al. experiment, we present it here to highlight the generality of the discovery.

Table of Contents
Data import
Data description

3 Data preprocessing

Conventional flow cytometers and mass cytometers produce .fcs files that can be manually analyzed using programs such as FlowJo [TriStar] or Cytobank (Kotecha, Krutzik, and Irish 2001), or using the R/Bioconductor packages, such as flowWorkspace (Finak and Jiang 2011) and openCyto (Finak, Frelinger, et al. 2014). During this initial analysis step, dead cells are removed, compensation is checked and with simple two dimensional scatter plots (e.g. marker intensity versus time), marker expression patterns are checked. Often, modern experiments are barcoded in order to remove analytical biases due to individual sample variation or acquisition time. Preprocessing steps including normalization using bead standards (Finck et al. 2013), de-barcoding (Zunder et al. 2015) and compensation can be completed with the CATALYST package (Chevrier et al. 2017), which also provides a Shiny app for the interactive analysis. Of course, preprocessing steps can occur using custom scripts within R or outside of R (e.g. Normalizer (Finck et al. 2013)).

Table of Contents
Data transformation
Data preprocessing

4 Data import

We recommend as standard practice to keep an independent record of all samples collected, with additional information about the experimental condition, including sample or patient identifiers, processing batch and so on. That is, we recommend having a trail of metadata for each experiment. In our example, the metadata file, PBMC8_metadata.xlsx, can be downloaded from the Robinson Lab server with the download.file function. For the workflow, the user should place it in the current working directory (getwd()). Here, we load it into R with the read_excel function from the readxl package and save it into a variable called md, but other file types and interfaces to read them in are also possible.

The data frame md contains the following columns:

  • file_name with names of the .fcs files corresponding to the reference (suffix “Reference”) and BCR/FcR-XL stimulation (suffix “BCR-XL”) samples,

  • sample_id with shorter unique names for each sample containing information about conditions and patient IDs,

  • condition describes whether samples originate from the reference (Ref) or stimulated (BCRXL) condition,

  • patient_id defines the IDs of patients.

The sample_id variable is used as row names in metadata and will be used all over the workflow to label the samples. It is important to carefully check whether variables are of the desired type (factor, numeric, character), since input methods may convert columns into different data types. For the statistical modeling, we want to make the condition variable a factor with the reference (Ref) samples being the reference level, where the order of factor levels can be defined with the levels parameter of the factor function. We also specify colors for the different conditions in a variable color_conditions.

url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
metadata_filename <- "PBMC8_metadata.xlsx"
download.file(paste0(url, "/", metadata_filename), destfile = metadata_filename,
  mode = "wb")
md <- read_excel(metadata_filename)

## Make sure condition variables are factors with the right levels
md$condition <- factor(md$condition, levels = c("Ref", "BCRXL"))
##                            file_name sample_id condition patient_id
## 1    PBMC8_30min_patient1_BCR-XL.fcs    BCRXL1     BCRXL   Patient1
## 2 PBMC8_30min_patient1_Reference.fcs      Ref1       Ref   Patient1
## 3    PBMC8_30min_patient2_BCR-XL.fcs    BCRXL2     BCRXL   Patient2
## 4 PBMC8_30min_patient2_Reference.fcs      Ref2       Ref   Patient2
## 5    PBMC8_30min_patient3_BCR-XL.fcs    BCRXL3     BCRXL   Patient3
## 6 PBMC8_30min_patient3_Reference.fcs      Ref3       Ref   Patient3
## Define colors for conditions
color_conditions <- c("#6A3D9A", "#FF7F00")
names(color_conditions) <- levels(md$condition)

The .fcs files listed in the metadata can be downloaded manually from the Citrus Cytobank repository or automatically from the Robinson Lab server where they are saved in a compressed archive file, PBMC8_fcs_files.zip.

fcs_filename <- "PBMC8_fcs_files.zip"
download.file(paste0(url, "/", fcs_filename), destfile = fcs_filename, 
  mode = "wb")

To load the content of the .fcs files into R, we use the flowCore package (Hahne et al. 2009). Using read.flowSet, we read in all files into a flowSet object, which is a general container for HDCyto data. Importantly, read.flowSet and the underlying read.FCS functions, by default, may transform the marker intensities and remove cells with extreme positive values. We keep these options off to be sure that we control the exact preprocessing steps.

fcs_raw <- read.flowSet(md$file_name, transformation = FALSE, 
  truncate_max_range = FALSE)

In our example, information about the panel is also available in a file called PBMC8_panel.xlsx, and can be downloaded from the Robinson Lab server and loaded into a variable called panel. It contains columns for Isotope and Metal that define the atomic mass number and the symbol of the chemical element conjugated to the antibody, respectively, and Antigen, which specifies the protein marker that was targeted; two additional columns specify whether a channel belongs to the lineage or surface type of marker.

The isotope, metal and antigen information that the instrument receives is also stored in the flowFrame (container for one sample) or flowSet (container for multiple samples) objects. You can type fcs_raw[[1]] to see the first flowFrame, which contains a table with columns name and desc. Their content can be accessed with functions pData(parameters()), which is identical for all the flowFrame objects in the flowSet. The variable name corresponds to the column names in the flowSet object, you can type in R colnames(fcs_raw).

It should be checked that elements from panel can be matched to their corresponding entries in the flowSet object to make the analysis less prone to subsetting mistakes. Here, for example, the entries in panel$Antigen have their exact equivalents in the desc columns of the flowFrame objects. In the following analysis, we will often use marker IDs as column names in the tables containing expression values. As a cautionary note, during object conversion from one type to another (e.g. in the creation of data.frame from a matrix), some characters (e.g. dashes) in the dimension names are replaced with dots, which may cause problems in matching. To avoid this problem, we replace all the dashes with underscores. Also, we define two variables that indicate the lineage and functional markers.

panel_filename <- "PBMC8_panel.xlsx"
download.file(paste0(url, "/", panel_filename), destfile = panel_filename, 
  mode = "wb")
panel <- read_excel(panel_filename)
##   Metal Isotope Antigen Lineage Functional
## 1    Cd 110:114     CD3       1          0
## 2    In     115    CD45       1          0
## 3    La     139     BC1       0          0
## 4    Pr     141     BC2       0          0
## 5    Nd     142   pNFkB       0          1
## 6    Nd     144    pp38       0          1
# Replace problematic characters 
panel$Antigen <- gsub("-", "_", panel$Antigen)

panel_fcs <- pData(parameters(fcs_raw[[1]]))
##               name        desc   range  minRange maxRange
## $P1           Time        Time 2377271   0.00000  2377270
## $P2    Cell_length Cell_length      66   0.00000       65
## $P3 CD3(110:114)Dd         CD3    1212 -13.66756     1211
## $P4  CD45(In115)Dd        CD45    2654   0.00000     2653
## $P5   BC1(La139)Dd         BC1   13357   0.00000    13356
## $P6   BC2(Pr141)Dd         BC2      39 -66.97583       38
# Replace problematic characters 
panel_fcs$desc <- gsub("-", "_", panel_fcs$desc)

# Lineage markers
(lineage_markers <- panel$Antigen[panel$Lineage == 1])
##  [1] "CD3"    "CD45"   "CD4"    "CD20"   "CD33"   "CD123"  "CD14"   "IgM"   
##  [9] "HLA_DR" "CD7"
# Functional markers
(functional_markers <- panel$Antigen[panel$Functional == 1])
##  [1] "pNFkB"  "pp38"   "pStat5" "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3"
##  [9] "pSlp76" "pBtk"   "pPlcg2" "pErk"   "pLat"   "pS6"
# Spot checks
all(lineage_markers %in% panel_fcs$desc)
## [1] TRUE
all(functional_markers %in% panel_fcs$desc)
## [1] TRUE

Table of Contents
Diagnostic plots
Data import

5 Data transformation

Usually, the raw marker intensities read by a cytometer have strongly skewed distributions with varying ranges of expression, thus making it difficult to distinguish between the negative and positive cell populations. It is common practice to transform CyTOF marker intensities using, for example, arcsinh (hyperbolic inverse sine) with cofactor 5 (Bendall et al. 2011 Figure S2; Bruggner et al. 2014) to make the distributions more symmetric and to map them to a comparable range of expression, which is important for clustering. A cofactor of 150 has been promoted for flow cytometry, but users are free to implement alternative transformations, some of which are available from the transform function of the flowCore package. In the following step, we include only those channels that correspond to the lineage and functional markers. We also rename the columns in the flowSet to the antigen names from panel$desc.

## arcsinh transformation and column subsetting
fcs <- fsApply(fcs_raw, function(x, cofactor = 5){
  colnames(x) <- panel_fcs$desc
  expr <- exprs(x)
  expr <- asinh(expr[, c(lineage_markers, functional_markers)] / cofactor)
  exprs(x) <- expr
## A flowSet with 16 experiments.
##   column names:
##   CD3 CD45 CD4 CD20 CD33 CD123 CD14 IgM HLA_DR CD7 pNFkB pp38 pStat5 pAkt pStat1 pSHP2 pZap70 pStat3 pSlp76 pBtk pPlcg2 pErk pLat pS6

For some of the further analysis, it is more convenient for us to work using a matrix (called expr) that contains marker expression for cells from all samples. We create such a matrix with the fsApply function that extracts the expression matrices (function exprs) from each element of the flowSet object, and by default, concatenates them into one matrix.

## Extract expression
expr <- fsApply(fcs, exprs)
## [1] 172791     24

As the ranges of marker intensities can vary substantially, we apply another transformation that scales expression of all markers to values between 0 and 1 using low (e.g. 1%) and high (e.g. 99%) percentiles as the boundary. This additional transformation of the arcsinh-transformed data can sometimes give better representation of relative differences in marker expression between annotated cell populations, however, it is only used here for visualization.

rng <- colQuantiles(expr, probs = c(0.01, 0.99))
expr01 <- t((t(expr) - rng[, 1]) / (rng[, 2] - rng[, 1]))
expr01[expr01 < 0] <- 0
expr01[expr01 > 1] <- 1

Table of Contents
Marker ranking based on the non-redundancy score
Data transformation

6 Diagnostic plots

We propose some quick checks to verify whether the data we analyze globally represents what we expect; for example, whether samples that are replicates of one condition are more similar and are distinct from samples from another condition. Another important check is to verify that marker expression distributions do not have any abnormalities such as having different ranges or distinct distributions for a subset of the samples. This could highlight problems with the sample collection or HDCyto acquisition, or batch effects that were unexpected. Depending on the situation, one can then consider removing problematic markers or samples from further analysis; in the case of batch effects, a covariate column could be added to the metadata table and used below in the statistical analyses.

The step below generates a plot with per-sample marker expression distributions, colored by condition (see Figure 1). Here, we can already see distinguishing markers, such as pNFkB and CD20, between stimulated and unstimulated conditions.

## Generate sample IDs corresponding to each cell in the `expr` matrix
sample_ids <- rep(md$sample_id, fsApply(fcs_raw, nrow))

ggdf <- data.frame(sample_id = sample_ids, expr)
ggdf <- melt(ggdf, id.var = "sample_id", 
  value.name = "expression", variable.name = "antigen")
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = expression, color = condition, 
  group = sample_id)) +
  geom_density() +
  facet_wrap(~ antigen, nrow = 4, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
    strip.text = element_text(size = 7), axis.text = element_text(size = 5)) +
  scale_color_manual(values = color_conditions)
Per-sample smoothed densities of marker expression (arcsinh-transformed) of 10 lineage markers and 14 functional markers in the PBMC dataset. Two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented and colored by experimental condition.

Figure 1: Per-sample smoothed densities of marker expression (arcsinh-transformed) of 10 lineage markers and 14 functional markers in the PBMC dataset. Two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented and colored by experimental condition.

Another spot check is the number of cells per sample (see Figure 2). This plot can be used as a guide together with other readouts to identify samples where not enough cells were assayed.

cell_table <- table(sample_ids)

ggdf <- data.frame(sample_id = names(cell_table), 
  cell_counts = as.numeric(cell_table))
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = sample_id, y = cell_counts, fill = condition)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = cell_counts), hjust=0.5, vjust=-0.5, size = 2.5) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  scale_fill_manual(values = color_conditions, drop = FALSE) +
  scale_x_discrete(drop = FALSE)
Barplot showing the number of cells measured for each sample in the PBMC dataset. Bars are colored by experimental condition: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL). Numbers in the names on the x-axis indicate patient IDs. Numbers on top of the bars indicate the cell counts.

Figure 2: Barplot showing the number of cells measured for each sample in the PBMC dataset. Bars are colored by experimental condition: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL). Numbers in the names on the x-axis indicate patient IDs. Numbers on top of the bars indicate the cell counts.

Table of Contents
Diagnostic plots

6.1 MDS plot

In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. Such plots show similarities between samples measured in an unsupervised way and give a sense of how much differential expression can be detected before conducting any formal tests. An MDS plot can be generated with the plotMDS function from the limma package. In transcriptomics, distances between samples are calculated based on the expression of the top varying genes. We propose a similar plot for HDCyto data using median marker expression over all cells to calculate dissimilarities between samples (other aggregations are also possible, and one could reduce the number of top varying markers to include in the calculation). Ideally, samples should cluster well within the same condition, although this depends on the magnitude of the difference between experimental conditions. With this diagnostic, one can identify the outlier samples and eliminate them if the circumstances warrant it.

In our MDS plot on median marker expression values (see Figure 3), we can see that the first dimension (MDS1) separates the unstimulated and stimulated samples reasonably well. The second dimension (MDS2) represents, to some degree, differences between patients. Most of the samples that originate from the same patient are placed at a similar point along the y-axis, for example, samples from patients 7, 5, and 8 are at the top of the plot, samples from patient 4 are located at the bottom of the plot. This also indicates that the marker expression of individual patients is driving similarity and perhaps should be formally accounted for in the downstream statistical modeling.

# Get the median marker expression per sample
expr_median_sample_tbl <- data.frame(sample_id = sample_ids, expr) %>%
  group_by(sample_id) %>% summarize_all(funs(median))

expr_median_sample <- t(expr_median_sample_tbl[, -1])
colnames(expr_median_sample) <- expr_median_sample_tbl$sample_id

mds <- plotMDS(expr_median_sample, plot = FALSE)

ggdf <- data.frame(MDS1 = mds$x, MDS2 = mds$y, 
  sample_id = colnames(expr_median_sample))
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = MDS1, y = MDS2, color = condition)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_label_repel(aes(label = sample_id)) +
  theme_bw() +
  scale_color_manual(values = color_conditions) +
MDS plot for the unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) samples obtained for each of the 8 healthy donors in the PBMC dataset. Calculations are based on the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample.  Distances between samples in the plot approximate the typical change in medians. Numbers in the label names indicate patient IDs.

Figure 3: MDS plot for the unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) samples obtained for each of the 8 healthy donors in the PBMC dataset. Calculations are based on the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample. Distances between samples in the plot approximate the typical change in medians. Numbers in the label names indicate patient IDs.

In contrast to genomic applications, the number of variables measured for each sample is much lower in HDCyto data. In the former, thousands of genes are surveyed, whereas in the latter, ~20-50 antigens are typically targeted. Similar to the MDS plot above, a heatmap of the same data also gives insight into the structure of the data. The heatmap shows median marker intensities with clustered columns (samples) and rows (markers). We have used hierarchical clustering with average linkage and euclidean distance, but also Ward’s linkage could be used (Bruggner et al. 2014), and in CyTOF applications, a cosine distance shows good performance (Bendall et al. 2014). In this plot, we can see which markers drive the observed clustering of samples (see Figure 4).

As with the MDS plot, the dendrogram separates the reference and stimulated samples very well. Also, similar groupings of patients within experimental conditions are observed (patients 1-2, 5-7-8 and 3-4 are together in both conditions).


# Column annotation for the heatmap
mm <- match(colnames(expr_median_sample), md$sample_id)
annotation_col <- data.frame(condition = md$condition[mm],
  row.names = colnames(expr_median_sample))
annotation_colors <- list(condition = color_conditions)

# Colors for the heatmap
color <- colorRampPalette(brewer.pal(n = 9, name = "YlGnBu"))(100)

pheatmap(expr_median_sample, color = color, display_numbers = TRUE, 
  number_color = "black", fontsize_number = 5, annotation_col = annotation_col, 
  annotation_colors = annotation_colors, clustering_method = "average")
Heatmap of the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample in the PBMC dataset. Color-coded with yellow for lower expression and blue for higher expression. The numbers in the heatmap represent the actual expression values. Dendrograms present clustering of samples (columns) and markers (rows) which is based on hierarchical clustering with Euclidean distance metric and average linkage. The two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented with a bar colored by experimental condition on top of the heatmap. Numbers in the column label names indicate patient IDs.

Figure 4: Heatmap of the median (arcsinh-transformed) marker expression of 10 lineage markers and 14 functional markers across all cells measured for each sample in the PBMC dataset. Color-coded with yellow for lower expression and blue for higher expression. The numbers in the heatmap represent the actual expression values. Dendrograms present clustering of samples (columns) and markers (rows) which is based on hierarchical clustering with Euclidean distance metric and average linkage. The two conditions: unstimulated (Ref) and stimulated with BCR/FcR-XL (BCRXL) for each of the 8 healthy donors are presented with a bar colored by experimental condition on top of the heatmap. Numbers in the column label names indicate patient IDs.

Table of Contents
Cell population identification with FlowSOM and ConsensusClusterPlus
Diagnostic plots

7 Marker ranking based on the non-redundancy score

In this step, we identify the ability of markers to explain the variance observed in each sample. In particular, we calculate the PCA-based non-redundancy score (NRS) (Levine et al. 2015). Markers with higher score explain a larger portion of variability present in a given sample.

The average NRS can be used to select a subset of markers that are non-redundant in each sample but at the same time capture the overall diversity between samples. Such a subset of markers can be then used for cell population identification analysis (i.e. clustering). We note that there is no precise rule on how to choose the right cutoff for marker inclusion, but one of the approaches is to select a suitable number of the top-scoring markers. The number can be chosen by analyzing the plot with the NR scores (see Figure 5), where the markers are sorted by the decreasing average NRS. Based on the prior biological knowledge, one can refine the marker selection and drop out markers that are not likely to distinguish cell populations of interest, even if they have high scores, and add in markers with low scores but known to be important in discerning cell subgroups (Levine et al. 2015). Thus, the NRS analysis serves more as a guide to marker selection and is not meant as a hardcoded rule.

In the dataset considered here (Bodenmiller et al. 2012; Bruggner et al. 2014) we want to use all the 10 lineage markers, so there is no explicit need to restrict the set of cell surface markers, and the NRS serve as another quality control step. There may be other situations where this feature selection step would be of interest, for example, in panel design (Levine et al. 2015).

## Define a function that calculates the NRS per sample 
NRS <- function(x, ncomp = 3){
  pr <- prcomp(x, center = TRUE, scale. = FALSE) 
  score <- rowSums(outer(rep(1, ncol(x)), 
    pr$sdev[1:ncomp]^2) * abs(pr$rotation[,1:ncomp]))

## Calculate the score
nrs_sample <- fsApply(fcs[, lineage_markers], NRS, use.exprs = TRUE)
rownames(nrs_sample) <- md$sample_id
nrs <- colMeans(nrs_sample, na.rm = TRUE)

## Plot the NRS for ordered markers
lineage_markers_ord <- names(sort(nrs, decreasing = TRUE))
nrs_sample <- data.frame(nrs_sample)
nrs_sample$sample_id <- rownames(nrs_sample)

ggdf <- melt(nrs_sample, id.var = "sample_id", 
  value.name = "nrs", variable.name = "antigen")

ggdf$antigen <- factor(ggdf$antigen, levels = lineage_markers_ord)
mm <- match(ggdf$sample_id, md$sample_id)
ggdf$condition <- md$condition[mm]

ggplot(ggdf, aes(x = antigen, y = nrs)) +
  geom_point(aes(color = condition), alpha = 0.9, 
    position = position_jitter(width = 0.3, height = 0)) +
  geom_boxplot(outlier.color = NA, fill = NA) +
  stat_summary(fun.y = "mean", geom = "point", shape = 21, fill = "white") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  scale_color_manual(values = color_conditions)
Non-redundancy scores for each of the 10 lineage markers and all samples in the PBMC dataset. The full points represent the per-sample NR scores (colored by experimental conditions), while empty black circles indicate the mean NR scores from all the samples. Markers on the x-axis are sorted according to the decreasing average NRS.

Figure 5: Non-redundancy scores for each of the 10 lineage markers and all samples in the PBMC dataset. The full points represent the per-sample NR scores (colored by experimental conditions), while empty black circles indicate the mean NR scores from all the samples. Markers on the x-axis are sorted according to the decreasing average NRS.

Table of Contents
Differential analysis
Marker ranking based on the non-redundancy score

8 Cell population identification with FlowSOM and ConsensusClusterPlus

Cell population identification typically has been carried out by manual gating, a method based on visual inspection of a series of two-dimensional scatterplots. At each step, a subset of cells, either positive or negative for the two visualized markers, is selected and further stratified in the subsequent iterations until populations of interest across a range of marker combinations are captured. However, manual gating has drawbacks, such as subjectivity, bias toward well-known cell types, and inefficiency when analyzing large datasets, which also contribute to a lack of reproducibility (Saeys, Gassen, and Lambrecht 2016).

Considerable effort has been made to improve and automate cell population identification, such as unsupervised clustering (Aghaeepour et al. 2013). However, not all methods scale well in terms of performance and speed from the lower dimensionality flow cytometry data to the higher dimensionality mass cytometry data (Weber and Robinson 2016), since clustering in higher dimensions can suffer the “curse of dimensionality”.

Beside the mathematical and algorithmic challenges of clustering, cell population identification may be difficult due to the chemical and biological aspects of the cytometry experiment itself. Therefore, caution should be taken when designing panels aimed at detecting rare cell populations by assigning higher sensitivity metals to rare markers. The right choice of a marker panel used for clustering can also be important. It should include all markers that are relevant for cell type identification.

In this workflow, we conduct cell clustering with FlowSOM (Van Gassen et al. 2015) and ConsensusClusterPlus (Wilkerson and Hayes 2010), which appeared amongst the fastest and best performing clustering approaches in a recent study of HDCyto datasets (Weber and Robinson 2016). This ensemble showed strong performance in detecting both high and low frequency cell populations and is one of the fastest methods to run, which enables its interactive usage. We use a slight modification of the original workflow presented in the FlowSOM vignette, which we find more flexible. In particular, we directly call the ConsensusClusterPlus function that is embedded in metaClustering_consensus. Thus, we are able to access all the functionality of the ConsensusClusterPlus package to identify the number of clusters.

The FlowSOM workflow consists of three main steps. First, a self-organizing map (SOM) is built using the BuildSOM function, where cells are assigned according to their similarities to 100 (by default) grid points (or, so-called codebook vectors or codes) of the SOM. The building of a minimal spanning tree, which is mainly used for graphical representation of the clusters, is skipped in this pipeline. And finally, metaclustering of the SOM codes, is performed directly with the ConsensusClusterPlus function. Additionally, we add an optional round of manual expert-based merging of the metaclusters and allow this to be done in a reproducible fashion.

FlowSOM output can be sensitive to random starts (Weber and Robinson 2016). To make results reproducible, one must specify the seed for the random number generation in R using function set.seed. It is also advisable to rerun analyses with multiple random seeds, for two reasons. First, one can see how robust the detected clusters are, and second, when the goal is to find smaller cell populations, it may happen that, in some runs, random starting points do not represent rare cell populations, as the chance of selecting starting cells from them is low and they are merged into a larger cluster.

It is important to point out that we cluster all cells from all samples together. This strategy is beneficial, since we directly obtain cluster assignment for each cell, we label cell populations only once and the mapping of cell types between samples is automatically consistent. For a list of alternative approaches and their advantages and disadvantages, please refer to the Discussion section, where we consider: clustering per sample, clustering of data from different measurement batches and down-sampling in case of widely varying numbers of cells per sample.

In our analysis, cell populations are identified using only the 10 lineage markers as defined in the BuildSOM function with the colsToUse argument.

fsom <- ReadInput(fcs, transform = FALSE, scale = FALSE)
som <- BuildSOM(fsom, colsToUse = lineage_markers)
## Get the cell clustering into 100 SOM codes
cell_clustering_som <- som$map$mapping[,1]

Automatic approaches for selecting the number of clusters in HDCyto data do not always succeed (Weber and Robinson 2016). In general, we therefore recommend some level of over-clustering, and if desired, manual merging of clusters. Such a hierarchical approach is especially suited when the goal is to detect smaller cell populations.

The SPADE analysis performed by Bodenmiller et al. (Bodenmiller et al. 2012) identified 6 main cell types (T-cells, monocytes, dendritic cells, B-cells, NK cells and surface- cells) that were further stratified into 14 more specific subpopulations (CD4+ T-cells, CD8+ T-cells, CD14+ HLA-DR high monocytes, CD14+ HLA-DR med monocytes, CD14+ HLA-DR low monocytes, CD14- HLA-DR high monocytes, CD14- HLA-DR med monocytes, CD14- HLA-DR low monocytes, dendritic cells, IgM+ B-cells, IgM- B-cells, NK cells, surface- CD14+ cells and surface- CD14- cells). In our analysis, we are interested in identifying the 6 main PBMC populations, including: CD4+ T-cells, CD8+ T-cells, monocytes, dendritic cells, NK cells and B-cells. Following the concept of over-clustering we perform the metaclustering of the (by default) 100 SOM codes into more than expected number of groups. For example, stratification into 20 groups should give enough resolution to detect these main clusters. We can explore the clustering in a wide variety of visualizations: t-SNE plots, heatmaps and a plot generated by ConsensusClusterPlus called “delta area”.

When the interest is in studying more specific subpopulations at higher detail, one can follow a strategy of reclustering as described in the Obtaining higher resolution section, where we propose to repeat the workflow (clustering and differential analyses) after gating out a selected cell type (e.g. one of the large populations).

We call ConsensusClusterPlus with maximum number of clusters maxK = 20 and other clustering parameters set to the values as in the metaClustering_consensus function from the FlowSOM package. Again, to ensure that the analyses are reproducible, we define the random seed.

## Metaclustering into 20 clusters with ConsensusClusterPlus

codes <- som$map$codes
plot_outdir <- "consensus_plots"
nmc <- 20

mc <- ConsensusClusterPlus(t(codes), maxK = nmc, reps = 100, 
  pItem = 0.9, pFeature = 1, title = plot_outdir, plot = "png", 
  clusterAlg = "hc", innerLinkage = "average", finalLinkage = "average", 
  distance = "euclidean", seed = 1234)

## Get cluster ids for each cell
code_clustering1 <- mc[[nmc]]$consensusClass
cell_clustering1 <- code_clustering1[cell_clustering_som]

We can then investigate characteristics of identified clusters with heatmaps that illustrate median marker expression in each cluster (see Figure 6). As the range of marker expression can vary substantially from marker to marker, we use the 0-1 transformed data (expr01) for some visualizations. However, to stay consistent with FlowSOM and ConsensusClusterPlus, we use the (arcsinh-transformed) unscaled data (expr) to generate the dendrogram of the hierarchical structure of metaclusters.

Instead of using only medians, which do not give a full representation of cluster specifics, one can plot the entire marker expression distribution in each cluster (see Figure 7). Such a plot gives more detailed profile of each cluster, but represents an increase in the amount of information to interpret. Heatmaps give the overall overview of clusters, are quicker and easier to interpret, and together with the dendrogram can be a good basis for further cluster merging (see Cluster merging and annotation section).

Since we will use the heatmap and density plots again later on in this workflow, in code chunks below, we create wrapper functions that generate these two types of plots.

# Define cluster colors (here there are 30 colors)
color_clusters <- c("#DC050C", "#FB8072", "#1965B0", "#7BAFDE", "#882E72", 
  "#B17BA6", "#FF7F00", "#FDB462", "#E7298A", "#E78AC3", 
  "#33A02C", "#B2DF8A", "#55A1B1", "#8DD3C7", "#A6761D", 
  "#E6AB02", "#7570B3", "#BEAED4", "#666666", "#999999", 
  "#aa8282", "#d4b7b7", "#8600bf", "#ba5ce3", "#808000", 
  "#aeae5c", "#1e90ff", "#00bfff", "#56ff0d", "#ffff00")
plot_clustering_heatmap_wrapper <- function(expr, expr01, 
  cell_clustering, color_clusters, cluster_merging = NULL){
  # Calculate the median expression
  expr_median <- data.frame(expr, cell_clustering = cell_clustering) %>%
    group_by(cell_clustering) %>% summarize_all(funs(median))
  expr01_median <- data.frame(expr01, cell_clustering = cell_clustering) %>%
    group_by(cell_clustering) %>% summarize_all(funs(median))
  # Calculate cluster frequencies
  clustering_table <- as.numeric(table(cell_clustering))
  clustering_prop <- round(clustering_table / sum(clustering_table) * 100, 2)
  # Sort the cell clusters with hierarchical clustering
  d <- dist(expr_median[, colnames(expr)], method = "euclidean")
  cluster_rows <- hclust(d, method = "average")
  expr_heat <- as.matrix(expr01_median[, colnames(expr01)])
  rownames(expr_heat) <- expr01_median$cell_clustering
  # Colors for the heatmap
  color_heat <- colorRampPalette(rev(brewer.pal(n = 9, name = "RdYlBu")))(100)
  legend_breaks = seq(from = 0, to = 1, by = 0.2)
  labels_row <- paste0(expr01_median$cell_clustering, " (", clustering_prop , 
  # Annotation for the original clusters
  annotation_row <- data.frame(Cluster = factor(expr01_median$cell_clustering))
  rownames(annotation_row) <- rownames(expr_heat)
  color_clusters1 <- color_clusters[1:nlevels(annotation_row$Cluster)]
  names(color_clusters1) <- levels(annotation_row$Cluster)
  annotation_colors <- list(Cluster = color_clusters1)
  # Annotation for the merged clusters
    cluster_merging$new_cluster <- factor(cluster_merging$new_cluster)
    annotation_row$Cluster_merging <- cluster_merging$new_cluster 
    color_clusters2 <- color_clusters[1:nlevels(cluster_merging$new_cluster)]
    names(color_clusters2) <- levels(cluster_merging$new_cluster)
    annotation_colors$Cluster_merging <- color_clusters2
  pheatmap(expr_heat, color = color_heat, cluster_cols = FALSE, 
    cluster_rows = cluster_rows, labels_row = labels_row, 
    display_numbers = TRUE, number_color = "black", 
    fontsize = 8, fontsize_number = 6, legend_breaks = legend_breaks,
    annotation_row = annotation_row, annotation_colors = annotation_colors)

plot_clustering_heatmap_wrapper(expr = expr[, lineage_markers_ord], 
  expr01 = expr01[, lineage_markers_ord], 
  cell_clustering = cell_clustering1, color_clusters = color_clusters)