Contents

1 Introduction

This vignette is centered around the application of pipeComp to scRNA-seq clustering pipelines, and assumes a general understanding of pipeComp (for an overview, see the pipeComp vignette).

The scRNAseq PipelineDefinition comes in two variants determined by the object used as a backbone, either SingleCellExperiment (SCE) or seurat (see ?scrna_pipeline). Both use the same evaluation metrics, and most method wrappers included in the package have been made so that they are compatible with both. For simplicity, we will therefore stick to just one variant here, and will focus on few very basic comparisons to illustrate the main functionalities, metrics and evaluation plots. For more detailed benchmarks, refer to our preprint:

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools
Pierre-Luc Germain, Anthony Sonrel & Mark D Robinson, bioRxiv 2020.02.02.930578



1.1 The PipelineDefinition

The PipelineDefinition can be obtained with the following function:

library(pipeComp)
# we use the variant of the pipeline used in the paper
pipDef <- scrna_pipeline(pipeClass = "seurat")
pipDef
## A PipelineDefinition object with the following steps:
##   - doublet(x, doubletmethod) * 
## Takes a SCE object with the `phenoid` colData column, passes it through the 
## function `doubletmethod`, and outputs a filtered SCE.
##   - filtering(x, filt) * 
## Takes a SCE object, passes it through the function `filt`, and outputs a 
## filtered Seurat object.
##   - normalization(x, norm) * 
## Passes the object through function `norm` and returns it with the normalized and scale data slots filled.
##   - selection(x, sel, selnb=2000)
## Returns a Seurat object with the VariableFeatures filled with `selnb` features
##  using the function `sel`.
##   - dimreduction(x, dr, maxdim=50) * 
## Returns a Seurat object with the PCA reduction with up to `maxdim` components
## using the `dr` function.
##   - clustering(x, clustmethod, dims=20, k=20, steps=8, resolution=c(0.01, 0.1, 0.5, 0.8, 1), min.size=50) * 
## Uses function `clustmethod` to return a named vector of cell clusters.

1.2 Example run

To illustrate the use of the pipeline, we will run a basic comparison using wrappers that are included in the package. However, in order for pipeComp not to systematically require the installation of all dependencies related to all methods for which there are wrappers, they were not included in the package code but rather as source files, which can be loaded in the following way:

source(system.file("extdata", "scrna_alternatives.R", package="pipeComp"))

(To know which packages are required by the set of wrappers you intend to use, see ?checkPipelinePackages)

Any function that has been loaded in the environment can then be used as alternative. We define a small set of alternatives to test:

alternatives <- list(
  doubletmethod=c("none"),
  filt=c("filt.lenient", "filt.stringent"),
  norm=c("norm.seurat", "norm.sctransform", "norm.scran"),
  sel=c("sel.vst"),
  selnb=2000,
  dr=c("seurat.pca"),
  clustmethod=c("clust.seurat"),
  dims=c(10, 15, 20, 30),
  resolution=c(0.01, 0.1, 0.2, 0.3, 0.5, 0.8, 1, 1.2, 2)
)

We also assume three datasets in (SCE) format (not included in the package - see the last section of this vignette) and run the pipeline:

# available on https://doi.org/10.6084/m9.figshare.11787210.v1
datasets <- c( mixology10x5cl="/path/to/mixology10x5cl.SCE.rds",
               simMix2="/path/to/simMix2.SCE.rds",
               Zhengmix8eq="/path/to/Zhengmix8eq.SCE.rds" )
# not run
res <- runPipeline( datasets, alternatives, pipDef, nthreads=3)

Instead of running the analyses here, we will load the final example results:

data("exampleResults", package = "pipeComp")
res <- exampleResults



2 Exploring the metrics

Benchmark metrics are organized according to the step at which they are computed, and will be presented here in this fashion. This does not mean that they are relevant only for that step: alternative parameters at a given step can also be evaluated with respect to the metrics defined in all downstream steps.

2.1 Doublet detection and cell filtering

The evaluation performed after the first two steps (doublet detection and filtering) is the same:

head(res$evaluation$filtering)
##          dataset doubletmethod           filt subpopulation N.before N.lost
## 1 mixology10x5cl          none   filt.lenient          A549     1244      0
## 2 mixology10x5cl          none   filt.lenient         H1975      515      1
## 3 mixology10x5cl          none   filt.lenient         H2228      751      0
## 4 mixology10x5cl          none   filt.lenient          H838      840      0
## 5 mixology10x5cl          none   filt.lenient        HCC827      568      0
## 6 mixology10x5cl          none filt.stringent          A549     1244     40
##   pc.lost
## 1   0.000
## 2   0.194
## 3   0.000
## 4   0.000
## 5   0.000
## 6   3.215

For each method and subpopulation, we report:

  • N.before the number of cells before the step
  • N.lost the number of cells excluded by the step
  • pc.lost the percentage of cells lost (relative to the supopulation)

As noted in our manuscript, stringent filtering can lead to strong bias against certain supopulations. We therefore especially monitor the max pc.lost of different methods in relation to the impact on clustering accuracy (privileging, at this step, metrics that are not dependent on the relative abundances of the subpopulations, such as the mean F1 score per subpopulation). This can conveniently be done using the following function:

scrna_evalPlot_filtering(res)

2.2 Evaluation based on the reduced space

Evaluations based on the reduced space are much more varied:

names(res$evaluation$dimreduction)
## [1] "silhouette"             "varExpl.subpops"        "corr.covariate"        
## [4] "meanAbsCorr.covariate2" "PC1.covar.adjR2"

2.2.1 Subpopulation silhouette

The silhouette slot contains information about the silhouettes width of true subpopulations. Depending on the methods used for dimensionality (i.e. fixed vs estimated number of dimensions), there will be a single output or outputs for different sets of dimensions, as it is the case in our example:

names(res$evaluation$dimreduction$silhouette)
## [1] "top_10_dims" "top_20_dims" "all_50_dims"

For each of them we have a data.frame including, for each subpopulation in each analysis (i.e. combination of parameters), the minimum, maximum, median and mean silhouette width:

head(res$evaluation$dimreduction$silhouette$top_10_dims)
##          dataset doubletmethod         filt             norm     sel selnb
## 1 mixology10x5cl          none filt.lenient      norm.seurat sel.vst  2000
## 2 mixology10x5cl          none filt.lenient      norm.seurat sel.vst  2000
## 3 mixology10x5cl          none filt.lenient      norm.seurat sel.vst  2000
## 4 mixology10x5cl          none filt.lenient      norm.seurat sel.vst  2000
## 5 mixology10x5cl          none filt.lenient      norm.seurat sel.vst  2000
## 6 mixology10x5cl          none filt.lenient norm.sctransform sel.vst  2000
##           dr maxdim subpopulation minSilWidth meanSilWidth medianSilWidth
## 1 seurat.pca     50          A549   0.3723309    0.6396069      0.6516611
## 2 seurat.pca     50         H1975  -0.6114923    0.1949538      0.2873076
## 3 seurat.pca     50         H2228  -0.1413490    0.4832060      0.4954937
## 4 seurat.pca     50          H838   0.3459593    0.5961926      0.6059980
## 5 seurat.pca     50        HCC827   0.2384042    0.5800197      0.5967093
## 6 seurat.pca     50          A549   0.1339476    0.6333684      0.6518073
##   maxSilWidth
## 1   0.7367229
## 2   0.4217019
## 3   0.6130284
## 4   0.7056178
## 5   0.6876504
## 6   0.7326905

This information can be plotted using the function scrna_evalPlot_silh; the function outputs a ComplexHeatmap, which means that many arguments of that package and options can be used, for instance:

library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.6.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
scrna_evalPlot_silh( res )

h <- scrna_evalPlot_silh( res, heatmap_legend_param=list(direction="horizontal") )
draw(h, heatmap_legend_side="bottom", annotation_legend_side = "bottom", merge_legend=TRUE)

See ?scrna_evalPlot_silh for more options.

2.2.2 Variance in the PCs explained by the subpopulations

The slot varExpl.subpops indicates, for each analysis, the proportion of variance of each principal component explained by the true supopulations.

res$evaluation$dimreduction$varExpl.subpops[1:5,1:15]
##          dataset doubletmethod           filt             norm     sel selnb
## 1 mixology10x5cl          none   filt.lenient      norm.seurat sel.vst  2000
## 2 mixology10x5cl          none   filt.lenient norm.sctransform sel.vst  2000
## 3 mixology10x5cl          none   filt.lenient       norm.scran sel.vst  2000
## 4 mixology10x5cl          none filt.stringent      norm.seurat sel.vst  2000
## 5 mixology10x5cl          none filt.stringent norm.sctransform sel.vst  2000
##           dr maxdim      PC_1       PC_10       PC_11        PC_12        PC_13
## 1 seurat.pca     50 0.9631018 0.007716938 0.001831543 0.0015462150 0.0022304080
## 2 seurat.pca     50 0.9511425 0.007395324 0.002717310 0.0006250593 0.0118729485
## 3 seurat.pca     50 0.2369999 0.005925415 0.009554253 0.0055803311 0.0002206834
## 4 seurat.pca     50 0.9643754 0.007433402 0.002014857 0.0017917112 0.0026884483
## 5 seurat.pca     50 0.9586863 0.008118786 0.003153175 0.0005636794 0.0124421034
##          PC_14       PC_15
## 1 0.0007939924 0.002998006
## 2 0.0044840040 0.004285101
## 3 0.0006480071 0.000983346
## 4 0.0008178260 0.003457408
## 5 0.0033090652 0.003800675

2.2.3 Correlation with covariates

The following slots in res$evaluation$dimreduction track the correlation between principal components (PCs) and predefined cell-level covariates such as library size and number of detected genes: * corr.covariate contains the pearson correlation between the covariates and each PC; however, since there are major differences in library sizes between subpopulations, we advise against using this directly. * meanAbsCorr.covariate2 circumvents this bias by computing the mean absolute correlation (among the first 5 components) for each subpopulation, and averaging them. * PC1.covar.adjR2 gives the difference in adjusted R^2 between a model fit on PC1 containing the covariate along with subpopulations (PC1~subpopulation+covariate) and one without the covariate (PC1~subpopulation).

These metrics, as well as the following ones, can be plotted using generic pipeComp plotting functions, for example:

evalHeatmap(res, step="dimreduction", what="log10_total_features", 
            what2="meanAbsCorr.covariate2")

Since the output of these plotting functions are of class ComplexHeatmap, they can be combined:

evalHeatmap(res, step="dimreduction", what="log10_total_features", 
            what2="meanAbsCorr.covariate2") +
  evalHeatmap(res, step="dimreduction", what="log10_total_counts", 
            what2="meanAbsCorr.covariate2")

Alternatively, when the other arguments are shared, the following construction can also be used:

evalHeatmap( res, step="dimreduction", what2="meanAbsCorr.covariate2", 
             what=c("log10_total_features","log10_total_counts"), 
             row_title="mean(abs(correlation))\nwith covariate" )