Contents

1 Introduction

optimalFlow is a package dedicated to applying optimal-transport techniques to supervised flow cytometry gating based on the results in del Barrio et al. (2019).

We provide novel methods for grouping (clustering) gated cytometries. By clustering a set of cytometries we are producing groups (clusters) of cytometries that have lower variability than the whole collection. This in turn allows to improve greatly the performance of any supervised learning procedure. Once we have a partition (clustering) of a collection of cytometries, we provide several methods for obtaining an artificial cytometry (prototype, template) that represents in some optimal way the cytometries in each respective group. These prototypes can be used, among other things, for matching populations between different cytometries. Even more, a procedure able to group similar cytometries could help to detect individuals with a common particular condition, for instance some kind of disease.

optimalFlowTemplates is our procedure for clustering cytometries and obtaining templates. It is based on recent developments in the field of optimal transport such as a similarity distance between clusterings and a barycenter (Frechet mean) and k-barycenters of probability distributions.

We introduce optimalFlowClassification, a supervised classification tool for the case when a database of gated cytometries is available. The procedure uses the prototypes obtained by optimalFlowTemplates on the database. These are used to initialize tclust, a robust extension of k-means that allows for non-spherical shapes, for gating a new cytometry (see Garcia-Escudero et al. (2008)). By using a similarity distance between the best clustering obtained by tclust and the artificial cytometries provided by optimalFlowTemplates we can assign the new cytometry to the most similar template (and the respective group of cytometries). We provide several options of how to assign cell types to the new cytometry using the most relevant information, represented by the assigned template and the respective cluster of cytometries.

2 Installation

Installation procedure:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("optimalFLow")

3 optimalFlowTemplates

library(optimalFlowData)
library(optimalFlow)
library(ellipse)

We start by providing a database of gated cytometries. In this case we select as a learning set 15 cytometries of healthy individuals, from the data provided in optimalFlowData. We will use Cytometry1 to test the results of our procedures. For simplicity and for the sake of a good visualisation we will select only some of the cell types, in particular a subset of 4 cell types.

database <- buildDatabase(
 dataset_names = paste0('Cytometry', c(2:5, 7:9, 12:17, 19, 21)),
   population_ids = c('Monocytes', 'CD4+CD8-', 'Mature SIg Kappa', 'TCRgd-'))

Then we apply optimalFlowTemplates to obtain a clustering of the database and a template cytometry for each group.

templates.optimalFlow <-
  optimalFlowTemplates(
    database = database
    )

When running the default mode for optimalFlowTemplates we obtain a plot as in the figure bellow and then we are asked how many clusters we want to look for. Figure 0: learning clustering From the plot it seems reasonable to look for 5 clusters of cytometries and we could introduce 5 and press enter, and the procedure will give us a clustering of the learning database and the respective templates. Since this is hard to show in a vignette, an equivalent way of doing this procedure is to execute the command bellow, where we ask for 5 clusters directly.

templates.optimalFlow <-
  optimalFlowTemplates(
    database = database, templates.number = 5, cl.paral = 1
    )
## [1] "step 1: 9.33483386039734 secs"
## [1] "step 2: 2.39247488975525 secs"
## [1] "Execution time: 11.7278881072998 secs"

Now let us understand what does optimalFlowTemplates return. In the entry templates we have the artificial cytometries, viewed as mixtures of multivariate normal distributions, corresponding to the clustering of the cytometries in the database argument.

length(templates.optimalFlow$templates) # The number of clusters, and, hence, of templates 
## [1] 5
length(templates.optimalFlow$templates[[1]]) # The number of elements of the first template, it contains four cell types
## [1] 4
templates.optimalFlow$templates[[1]][[1]] # The first element of the first template
## $mean
##  [1] 2192.494 3810.563 6952.128 5639.010 5384.128 2326.237 5922.809 2433.990
##  [9] 1616.252 1213.455
## 
## $cov
##              [,1]      [,2]        [,3]        [,4]       [,5]       [,6]
##  [1,] 266639.3056 -22292.14   -239.0103    535.9488   1567.078   6667.185
##  [2,] -22292.1404 836892.17 139258.7106 -11083.0742 -26847.427 -85912.718
##  [3,]   -239.0103 139258.71  84833.6993   2831.9105  -1760.813 -19372.852
##  [4,]    535.9488 -11083.07   2831.9105  16117.5767   9840.943   9733.885
##  [5,]   1567.0784 -26847.43  -1760.8132   9840.9435  17040.124  19173.992
##  [6,]   6667.1845 -85912.72 -19372.8520   9733.8847  19173.992 260893.832
##  [7,] -10530.5174 -12155.85 -12970.2311   8010.5740  11112.548  24739.592
##  [8,]   6396.7471 -53880.87 -11968.9063   7805.0808  10494.817  63602.554
##  [9,]   3686.7700 -83687.89 -17876.9602  13542.7223  17424.684  43574.279
## [10,]   2391.6169 -61685.79 -13296.8697  13933.2554  17777.542  40569.455
##             [,7]       [,8]      [,9]      [,10]
##  [1,] -10530.517   6396.747   3686.77   2391.617
##  [2,] -12155.854 -53880.868 -83687.89 -61685.794
##  [3,] -12970.231 -11968.906 -17876.96 -13296.870
##  [4,]   8010.574   7805.081  13542.72  13933.255
##  [5,]  11112.548  10494.817  17424.68  17777.542
##  [6,]  24739.592  63602.554  43574.28  40569.455
##  [7,] 135789.172   7525.035  21765.08  20272.825
##  [8,]   7525.035 153027.526  25157.58  26236.325
##  [9,]  21765.079  25157.581  76070.14  43082.883
## [10,]  20272.825  26236.325  43082.88  51273.104
## 
## $weight
## [1] 0.5845872
## 
## $type
## [1] "CD4+CD8-"

In the argument clustering we have the clustering of the cytometries in the database argument.

templates.optimalFlow$clustering
##  [1] 1 2 3 3 3 3 4 4 4 1 2 3 5 5 5

In the argument database.elliptical we have a list containing each cytometry in the database viewed as a mixture distribution. Each element of the list is a cytometry viewed as a mixture.

length(templates.optimalFlow$database.elliptical) # the number of elements in the database
## [1] 15
length(templates.optimalFlow$database.elliptical[[1]]) # the number of cell types in the first element of the database
## [1] 4
templates.optimalFlow$database.elliptical[[1]][[1]] # the parameters corresponding to the first cell type in the first cytometry of the database 
## $mean
## CD19/TCRgd:PE Cy7-A LOGICAL       CD38:APC H7-A LOGICAL 
##                    2143.244                    3769.941 
##           CD3:APC-A LOGICAL       CD4+CD20:PB-A LOGICAL 
##                    6912.857                    5613.595 
##           CD45:PO-A LOGICAL       CD56+IgK:PE-A LOGICAL 
##                    5404.314                    2068.134 
##   CD5:PerCP Cy5-5-A LOGICAL      CD8+IgL:FITC-A LOGICAL 
##                    5784.786                    2403.888 
##                FSC-A LINEAR           SSC-A Exp-SSC Low 
##                    1578.943                    1226.318 
## 
## $cov
##                             CD19/TCRgd:PE Cy7-A LOGICAL CD38:APC H7-A LOGICAL
## CD19/TCRgd:PE Cy7-A LOGICAL                 304347.8532            -31430.672
## CD38:APC H7-A LOGICAL                       -31430.6717            786050.701
## CD3:APC-A LOGICAL                            -2056.8937            150481.538
## CD4+CD20:PB-A LOGICAL                          666.2252            -10919.205
## CD45:PO-A LOGICAL                             2933.2678            -34008.120
## CD56+IgK:PE-A LOGICAL                         7953.3654            -68622.841
## CD5:PerCP Cy5-5-A LOGICAL                    -7811.8852              5441.639
## CD8+IgL:FITC-A LOGICAL                        4737.6958            -42476.068
## FSC-A LINEAR                                  3825.7518            -87894.919
## SSC-A Exp-SSC Low                             3422.1236            -72960.311
##                             CD3:APC-A LOGICAL CD4+CD20:PB-A LOGICAL
## CD19/TCRgd:PE Cy7-A LOGICAL         -2056.894              666.2252
## CD38:APC H7-A LOGICAL              150481.538           -10919.2047
## CD3:APC-A LOGICAL                   93068.194             2462.9102
## CD4+CD20:PB-A LOGICAL                2462.910            15537.7342
## CD45:PO-A LOGICAL                   -4062.620             9567.5234
## CD56+IgK:PE-A LOGICAL              -11052.254             8858.6946
## CD5:PerCP Cy5-5-A LOGICAL           -6831.220             8039.8935
## CD8+IgL:FITC-A LOGICAL              -8281.683             7577.5080
## FSC-A LINEAR                       -21798.893            14235.2106
## SSC-A Exp-SSC Low                  -17098.563            15356.9021
##                             CD45:PO-A LOGICAL CD56+IgK:PE-A LOGICAL
## CD19/TCRgd:PE Cy7-A LOGICAL          2933.268              7953.365
## CD38:APC H7-A LOGICAL              -34008.120            -68622.841
## CD3:APC-A LOGICAL                   -4062.620            -11052.254
## CD4+CD20:PB-A LOGICAL                9567.523              8858.695
## CD45:PO-A LOGICAL                   18350.645             23190.956
## CD56+IgK:PE-A LOGICAL               23190.956            305557.028
## CD5:PerCP Cy5-5-A LOGICAL           10778.661             25521.650
## CD8+IgL:FITC-A LOGICAL              10729.306             36294.993
## FSC-A LINEAR                        20416.550             42905.671
## SSC-A Exp-SSC Low                   22260.189             46700.463
##                             CD5:PerCP Cy5-5-A LOGICAL CD8+IgL:FITC-A LOGICAL
## CD19/TCRgd:PE Cy7-A LOGICAL                 -7811.885               4737.696
## CD38:APC H7-A LOGICAL                        5441.639             -42476.068
## CD3:APC-A LOGICAL                           -6831.220              -8281.683
## CD4+CD20:PB-A LOGICAL                        8039.894               7577.508
## CD45:PO-A LOGICAL                           10778.661              10729.306
## CD56+IgK:PE-A LOGICAL                       25521.650              36294.993
## CD5:PerCP Cy5-5-A LOGICAL                  132092.612               2550.771
## CD8+IgL:FITC-A LOGICAL                       2550.771             152836.556
## FSC-A LINEAR                                22131.202              20286.385
## SSC-A Exp-SSC Low                           21582.260              25351.863
##                             FSC-A LINEAR SSC-A Exp-SSC Low
## CD19/TCRgd:PE Cy7-A LOGICAL     3825.752          3422.124
## CD38:APC H7-A LOGICAL         -87894.919        -72960.311
## CD3:APC-A LOGICAL             -21798.893        -17098.563
## CD4+CD20:PB-A LOGICAL          14235.211         15356.902
## CD45:PO-A LOGICAL              20416.550         22260.189
## CD56+IgK:PE-A LOGICAL          42905.671         46700.463
## CD5:PerCP Cy5-5-A LOGICAL      22131.202         21582.260
## CD8+IgL:FITC-A LOGICAL         20286.385         25351.863
## FSC-A LINEAR                   73060.828         50515.337
## SSC-A Exp-SSC Low              50515.337         59919.397
## 
## $weight
## [1] 0.6898461
## 
## $type
## [1] "CD4+CD8-"

In order to get some intuition about our methodology we are going to give some visual examples. Users can do it for their own data once they have applied optimalFlowTemplates.

We start with a two-dimensional representation of the cytometries of the cluster labelled as 3. As we have gated cytometries in the database we know every cell type, and, even more, we can consider every cytometry as a mixture of multivariate Gaussian distributions and this is stored in templates.optimalFlow$database.elliptical. The user just has to select the variables in which to project the cytometries through the variable dimensions.

cytoPlotDatabase(templates.optimalFlow$database.elliptical[which(templates.optimalFlow$clustering == 3)], dimensions = c(4,3), xlim = c(0, 8000), ylim = c(0, 8000), xlab = "", ylab = "")

Black ellipses correspond to the cell type CD4+CD8- in each cytometry and enclose 95% of the probability for the respective multivariate normal distributions. Red ellipses correspond to Mature Sig Kappa and so on.

A three-dimensional plot of the same case is provided as a static image and can be obtained using the following code.

cytoPlotDatabase3d(templates.optimalFlow$database.elliptical[which(templates.optimalFlow$clustering == 3)], dimensions = c(4, 3, 9), xlim = c(0, 8000), ylim = c(0, 8000), zlim = c(0, 8000))
Figure 1: pooling database

Figure 1: pooling database

optimalFlowTemplates provides a template cytometry for each cluster, stored in the entry templates. We present here how to visualize in 2d the consensus cytometry, the template, corresponding to cluster 3. Recall that the cytometries belonging to cluster 3 have been plotted above. The code is straightforward, we access templates in templates.optimalFlow and select the third element of the list, since we are interested in cluster 3.

cytoPlot(templates.optimalFlow$templates[[3]], dimensions = c(4,3), xlim = c(0, 8000), ylim = c(0, 8000), xlab = "", ylab = "")