```
# Load required packages
suppressPackageStartupMessages({
library(treekoR)
library(SingleCellExperiment)
library(ggtree)
})
```

```
# Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("adam2o1o/treekoR")
library(treekoR)
```

treekoR is a novel framework that aims to utilise the hierarchical nature of single cell cytometry data, to find robust and interpretable associations between cell subsets and patient clinical end points. This is achieved by deriving the tree structure of cell clusters, followed by measuring the %parent (proportions of each node in the tree relative to the number of cells belonging to the immediate parent node), in addition to the %total (proportion of cells in each node relative to all cells). These proportions are then used in significance testing and classification models to determine which cell subpopulation proportions most correlated with the patient clinical outcome of interest. treekoR then provides an interactive visualisation which helps to highlight these results.

- DeBiasi_COVID_CD8_samp:
A
`SingleCellExperiment`

containing samples of flow cytometry expression data from 39 patients. This data represents a subset of a dataset that was originally used by De Biasi et al. (2020) for the characterisation of CD8+ T cells, comparing between COVID-19 patients and healthy controls.

```
data(COVIDSampleData)
sce <- DeBiasi_COVID_CD8_samp
```

treekoR requires the following information in the variables:

`exprs`

: Single cell expression data (\(n \times p\)), where \(p\) is the number of markers, and \(n\) is the number of cells`clusters`

: a vector of length \(n\) representing the cell type or cluster of each cell (can be`character`

or`numeric`

)`classes`

: a vector of length \(n\) containing the patient outcome/class each cell belongs to`samples`

: a vector of length \(n\) identifying the patient each cell belongs to

In this example: the `clusters`

contain 100 clusters generated by FlowSOM; `classes`

identify whether the cell belongs to a COVID-19 or healthy patient; and `samples`

identify which cell the patient comes from.

```
exprs <- t(assay(sce, "exprs"))
clusters <- colData(sce)$cluster_id
classes <- colData(sce)$condition
samples <- colData(sce)$sample_id
```

The scaled median marker expression for each cluster is calculated which is used to construct a hierarchical tree.

In this step, the choice of hierarchical aggregation method (which determines the structure of the tree) is determined. By default the framework chooses HOPACH to construct the tree via the `hierarchy_method`

argument, however any of the methods in `hclust`

can be used (see 3.5.1).

```
clust_tree <- getClusterTree(exprs,
clusters,
hierarchy_method="hopach")
```

Proportions of each cell cluster in the tree are calculated - both the proportion relative to all and proportion relative to the hierarchical parent. These proportions are used in a two sample t-test, testing for equal means between the patient clinical outcome using both types of proportions.

```
tested_tree <- testTree(phylo=clust_tree$clust_tree,
clusters=clusters,
samples=samples,
classes=classes,
pos_class_name=NULL)
```

`node`

: unique identifier for each node in the hierarchical tree`parent`

: the node of the parent`isTip`

: whether the node is a leaf node in the tree`clusters`

: the clusters belonging to the corresponding node`stat_all`

: test statistic obtained from testing between conditions using the proportion of the node relative to all cells (%total) in each sample.`pval_total`

is the corresponding p-value (unadjusted)`stat_parent`

: test statistic obtained from testing between conditions using the proportion of the node relative to cells in the parent node (%parent) in each sample.`pval_parent`

is the corresponding p-value (unadjusted)

```
res_df <- getTreeResults(tested_tree)
head(res_df, 10)
#> parent node isTip
#> 51 139 51 TRUE
#> 52 139 52 TRUE
#> 63 147 63 TRUE
#> 67 150 67 TRUE
#> 136 135 136 FALSE
#> 71 150 71 TRUE
#> 125 100 125 FALSE
#> 20 115 20 TRUE
#> 116 115 116 FALSE
#> 140 100 140 FALSE
#> clusters
#> 51 77
#> 52 66
#> 63 41
#> 67 32
#> 136 58, 67, 57
#> 71 54
#> 125 85, 84, 95, 97, 87, 74, 76, 96, 86, 69, 78, 56, 77, 66, 75, 68, 58, 67, 57
#> 20 89
#> 116 59, 88, 70, 60
#> 140 35, 43, 46, 73, 64, 100, 45, 21, 42, 31, 41, 33, 81, 34, 32, 52, 55, 44, 54, 53, 63, 24, 82, 23, 61, 51, 71, 62
#> stat_total stat_parent pval_total pval_parent
#> 51 1.0823881 4.439944 0.290116635 0.0003193122
#> 52 -1.9247385 -4.439944 0.069760334 0.0003193122
#> 63 1.8107720 3.337019 0.080505219 0.0022734266
#> 67 1.7508315 3.229075 0.092603817 0.0033277959
#> 136 1.4290582 -3.198607 0.163963670 0.0041906737
#> 71 -2.3514750 -2.829766 0.033743471 0.0082310590
#> 125 2.7073249 2.707325 0.011833319 0.0118333189
#> 20 -0.7764417 -2.897438 0.446585533 0.0124821272
#> 116 3.0537776 2.897438 0.005462897 0.0124821272
#> 140 -2.5102007 -2.510201 0.017809639 0.0178096392
```

The results of the previous steps are visualised by a coloured tree with a corresponding heatmap. The heatmap displays the median scaled marker expressions of each cluster to help understand what cell type each cluster may represent, and the tree not only reveals how clusters have been hierarchically aggregated, but is coloured on each node by the test statistic obtained when testing using the proportions relative to all of that node, with the branch connecting the child to the parent coloured by the test statistic obtained when testing using the proportions relative to parent of the child node.

```
plotInteractiveHeatmap(tested_tree,
clust_med_df = clust_tree$median_freq,
clusters=clusters)
#> Warning: Removed 1 rows containing missing values (geom_interactive_point).
```