1 Introduction

In this document, the user is presented with an analysis that DepecheR has been written to perform. There are lots of tweaks to this general outline, so the user is encouraged to read the help files for each function individually in addition. In cases where bugs are identified, feedback is most welcome, primarily on the github site github.com/theorell/DepecheR. Now let us get started.

2 Installation

This is how to install the package, if that has not already been done:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DepecheR")

# Example data description

The data used in this example is a semi-synthedic dataset, consisting of 1000
cytotoxic lymphocytes from each of 20 individuals. These have been categorized 
into two groups, and after this, alterations have been added to the sizes of 
some cell populations in both groups. This means that the groups can be 
separated based onthe sizes of certain cell types in the data. And this 
excersize will show how to identify these, and tell us what markers that define 
the separating cell types in question. 

Importantly, DepecheR does not provide any pre-processing tools, such as for
compensation/spectral unmixing of flow cytometry files. The clustering function
does have an internal algorithm to detect data with extreme tails, but this does
not circumvent the need to transform flow- or mass cytometry data. This can be 
done using either commercially available software or with R packages, such as 
Biocpkg("flowCore") or Biocpkg("flowVS").
library(DepecheR)
data('testData')
str(testData)
## 'data.frame':    20000 obs. of  16 variables:
##  $ ids   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SYK   : num  11.2 21.3 23.7 22.1 24.8 ...
##  $ FcER1g: num  15.4 19.8 18.7 17.9 28.2 ...
##  $ CD16  : num  27.4 23.7 17.9 19.2 19.8 ...
##  $ CD57  : num  6.45 5.55 9.65 17.41 86.38 ...
##  $ EAT.2 : num  21.2 18.2 23.6 39.6 25.9 ...
##  $ CD4   : num  80.7 82.6 88.2 89.6 12.7 ...
##  $ TCRgd : num  17.5 21.6 20.2 29 14.3 ...
##  $ CD8   : num  21.2 17.6 16.4 -11.4 78.3 ...
##  $ iCD3  : num  88.9 86.7 82.7 90.6 87.6 ...
##  $ NKG2C : num  21.2 56.3 36.5 62.9 21.5 ...
##  $ CD2   : num  43.2 64.6 73.6 14.7 75.6 ...
##  $ CD45RO: num  34.9 34.1 27.3 59.5 22.5 ...
##  $ CD3   : num  83.3 90.7 91.5 103.4 76.2 ...
##  $ CD56  : num  21.1 39.7 28.3 15.5 69.5 ...
##  $ label : int  0 0 0 0 0 0 0 0 0 0 ...

As can be noted here, the expected input format is either a dataframe or a matrix with cells as rows and markers/variables as columns. This is in accordance with the .fcs file convention. In this case, however, the different samples (coming from donors) should be added to the same dataframe, and a donor column should specify which cells that belong to which donor. This setup is chosen to make the software available to more different kinds of data.

3 depeche clustering

With the depeche clustering function, all necessary scaling and parameter selection is performed under the hood, so all we have to do, when we have the file of interest in the right format, is to run the function on the variables that we want to cluster on.

testDataDepeche <- depeche(testData[, 2:15])
## [1] "Files will be saved to ~/Desktop"
## [1] "As the dataset has less than 100 columns, peak centering is applied."
## [1] "Set 1 with 7 iterations completed in 14 seconds."
## [1] "Set 2 with 7 iterations completed in 6 seconds."
## [1] "Set 3 with 7 iterations completed in 6 seconds."
## [1] "The optimization was iterated 21 times."
str(testDataDepeche)
## List of 4
##  $ clusterVector     : int [1:20000] 2 2 2 2 6 3 5 2 1 1 ...
##  $ clusterCenters    : num [1:8, 1:14] 0 0 0 40.2 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "SYK" "FcER1g" "CD16" "CD57" ...
##  $ essenceElementList:List of 8
##   ..$ 1: chr [1:3] "CD4" "NKG2C" "CD45RO"
##   ..$ 2: chr [1:5] "CD4" "iCD3" "NKG2C" "CD2" ...
##   ..$ 3: chr [1:3] "CD57" "CD8" "CD2"
##   ..$ 4: chr [1:10] "SYK" "FcER1g" "CD16" "CD57" ...
##   ..$ 5: chr [1:3] "CD57" "CD8" "CD45RO"
##   ..$ 6: chr [1:4] "CD57" "CD8" "CD2" "CD56"
##   ..$ 7: chr [1:6] "SYK" "FcER1g" "iCD3" "CD2" ...
##   ..$ 8: chr [1:4] "TCRgd" "CD2" "CD45RO" "CD56"
##  $ penaltyOptList    :List of 2
##   ..$ :'data.frame': 1 obs. of  2 variables:
##   .. ..$ bestPenalty: num 16
##   .. ..$ k          : num 30
##   ..$ :'data.frame': 11 obs. of  2 variables:
##   .. ..$ ARI   : num [1:11] 0.59 0.581 0.683 0.689 0.857 ...
##   .. ..$ nClust: num [1:11] 29.4 28 27.1 24.3 20.4 ...

As can be seen above, the output from the function is relatively complex. If the names of each list element is not suficiently self explanatory, see (?depeche) for information about each slot.

3.1 depeche function output graphs

Two graphs are part of the output from the depeche function.

3.1.1 Adjusted Rand Index as a function of penalty values

This graph shows how internally reproducible the results were for each of the tested penalties. An Adjusted Rand Index of 1 shows that if any random subset of observations is clustered two times, each observation will be assigned to the same cluster both times. Conversely, an Adjusted Rand Index of 0 indicates the opposite, i.e. totally random distribution. The adjustment in “Adjusted” Rand index takes the divering probabilities of ending up with a high or low overlap in the special cases of very few and very many clusters into consideration.

Adjusted Rand Index as a function of penalty values. Each value is based on at least 7 clustering pairs (DepecheR standard in a 8 core machine).

Adjusted Rand Index as a function of penalty values. Each value is based on at least 7 clustering pairs (DepecheR standard in a 8 core machine).

3.1.2 Cluster centers

This graph shows in a heatmap format where the cluster center is located for each of the markers that are defined for the cluster in question. A light color indicates a high expression, whereas a dark color indicates low or absent expression. Grey color, on the other hand, indicates that the cluster in question did not contribute to defining the cluster in question. In some cases, the results might seem strange, as a cluster might have an expression very close to the center of the full dataset, but this expression still defined the cluster. This is due to an internal, and for stability reasons necessary, effect of the algotihm: a specific penalty will have a larger effect on a cluster with fewer observations, than on a cluster with many observations.

Cluster center heatmap. Marker names on the x-axis, cluster numbers on the y-axis.

Cluster center heatmap. Marker names on the x-axis, cluster numbers on the y-axis.

4 tSNE generation

To be able to visualize the results, we need to generate a stochastic neighbor embedding for the same data as we used to generate the depeche clustering. Any tool that generates tSNE vector can be used for this purpose. Personally, I use the multicore version of the Rtsne implementation, but other versions, that do not require openMP, might be preferrable to some. If you want to use the openMP version on a mac, check this link. EDIT 2019-09-12: I would now recommend to run umap instead of tSNE, as it is superior in speed, separation of clusters and interpretability. I will change the vinjette to incorporate this soon, but it has no other effects than making the clusters better separated on the plots, so I will not make that change now.

library(Rtsne)
testDataSNE <- Rtsne(testData[,2:15], pca=FALSE)

5 Visualization of depeche clusters on tSNE

Now, we want to evaluate how the different clusters are distributed on the tSNE. To do this, we need to generate a color vector from the cluster vector in the testDataDepeche. This cluster vector is then overlayed over the tSNE, and to make things easier to interpret, a separate legend is included as well. The reason that the legend is in a separate plot is for making it easier to use the plots for publication purposes. For file size reasons, it has namely been necessary to use PNG and not PDF for the plot files.

NB! The resolution of the files normally generated by DepecheR is considerably higher than in this vinjette, due to size restrictions.

dColorPlot(colorData = testDataDepeche$clusterVector, xYData = testDataSNE$Y, 
           colorScale = "dark_rainbow", plotName = "Cluster")
## png 
##   2