TrajectoryUtils 1.0.0

The *TrajectoryUtils* package contains low-level utilities to support trajectory inference packages.
*Support* is the key word here: this package does not perform any inference itself, deferring that to higher-level packages like *TSCAN* and *slingshot*.
In fact, most of the code in this package was factored out of those two packages after we realized their commonalities.
By rolling this out into a separate package, we can cut down on redundancy, facilitate better propagation of features and simplify maintenance.
If youโre a developer and you have a general utility for trajectory inference, contributions are welcome.

The construction of a cluster-based minium spanning tree is the primary motivator for this package,
as it was implemented separately in *TSCAN* and *slingshot*.
The idea is simple - cluster cells and create a minimum spanning tree from the cluster centroids to serve as the backbone for the trajectory reconstruction.

```
# Mocking up a Y-shaped trajectory.
centers <- rbind(c(0,0), c(0, -1), c(1, 1), c(-1, 1))
rownames(centers) <- seq_len(nrow(centers))
clusters <- sample(nrow(centers), 1000, replace=TRUE)
cells <- centers[clusters,]
cells <- cells + rnorm(length(cells), sd=0.5)
# Creating the MST:
library(TrajectoryUtils)
mst <- createClusterMST(cells, clusters)
plot(mst)
```

The implementation in `createClusterMST()`

combines several useful ideas from the two aforementioned packages:

- With
`outgroup=TRUE`

, users can add an outgroup to break apart distant clusters. This ensures that the MST does not form spurious links between unrelated parts of the dataset. - With
`dist.method="mnn"`

, users can create the MST based on the distances between mutually nearest neighbors. This avoids penalizing the formation of edges between adjacent heterogeneous clusters. - With
`dist.method="slingshot"`

, the Mahalanobis distance is computed from the covariance matrix for cells in each cluster. This accounts for the shape and spread of clusters when constructing the MST. - With
`use.median=TRUE`

, the centroids are computed by taking the median instead of the mean. This protects against clusters with many outliers.

Once the MST is constructed, we can use the `guessMSTRoots()`

function to guess the possible roots.
One root is guessed for each component of the MST, though the choice of root depends on the `method`

in use.
For example, we can choose a root node that maximizes the steps/distance to all terminal (degree-1) nodes;
this results in fewer branch events when defining paths through the MST.

`guessMSTRoots(mst, method="maxstep")`

`## [1] "2"`

`guessMSTRoots(mst, method="minstep")`

`## [1] "1"`

Once a root is guessed, we can define paths from the root to all terminal nodes using the `defineMSTPaths()`

function.
The interpretation of each path depends on the appropriateness of the chosen root;
for example, in developmental contexts, each path can be treated as a lineage if the root corresponds to some undifferentiated state.

`defineMSTPaths(mst, roots=c("1"))`

```
## [[1]]
## [1] "1" "2"
##
## [[2]]
## [1] "1" "3"
##
## [[3]]
## [1] "1" "4"
```

`defineMSTPaths(mst, roots=c("2"))`

```
## [[1]]
## [1] "2" "1" "3"
##
## [[2]]
## [1] "2" "1" "4"
```

Alternatively, if timing information is available (e.g., from an experimental time-course, or from computational methods like RNA velocity), we can use that to define paths through the MST based on the assumed movement of cells from local minima to the nearest local maxima in time. In the mock example below, the branch point at cluster 1 has the highest time so all paths converge from the terminal nodes to cluster 1.

`defineMSTPaths(mst, times=c(`1`=4, `2`=0, `3`=1, `4`=0))`

```
## [[1]]
## [1] "2"
##
## [[2]]
## [1] "3" "1"
##
## [[3]]
## [1] "4" "1"
```

We can also use a root-free method of defining paths through the MST, by simply defining all unbranched segments as separate paths. This decomposes the MST into simpler structures for, e.g., detection of marker genes, without explicitly defining a root or requiring external information. In fact, one might prefer this approach as it allows us to interpret each section of the MST in a more modular manner, without duplication of effort caused by the sharing of nodes across different paths from a common root.

`splitByBranches(mst)`

```
## [[1]]
## [1] "2" "1"
##
## [[2]]
## [1] "3" "1"
##
## [[3]]
## [1] "4" "1"
```

`PseudotimeOrdering`

classWe can create a compact representation of pseudotime orderings in the form of a matrix where rows are cells and columns are paths through the trajectory.
Each cell receives a pseudotime value along a path - or `NA`

, if the cell does not lie on that path.
On occasion, we may wish to annotate this with metadata on the cells or on the paths.
This is supported by the `PseudotimeOrdering`

class:

```
# Make up a matrix of pseudotime orderings.
ncells <- 200
npaths <- 5
orderings <- matrix(rnorm(1000), ncells, npaths)
# Default constructor:
(pto <- PseudotimeOrdering(orderings))
```

```
## class: PseudotimeOrdering
## dim: 200 5
## metadata(0):
## pathStats(1): ''
## cellnames: NULL
## cellData names(0):
## pathnames: NULL
## pathData names(0):
```

It is then straightforward to add metadata on the cells:

```
pto$cluster <- sample(LETTERS[1:5], ncells, replace=TRUE)
cellData(pto)
```

```
## DataFrame with 200 rows and 1 column
## cluster
## <character>
## 1 D
## 2 B
## 3 C
## 4 E
## 5 D
## ... ...
## 196 C
## 197 A
## 198 A
## 199 B
## 200 D
```

Or on the paths:

```
pathData(pto)$description <- sprintf("PATH-%i", seq_len(npaths))
pathData(pto)
```

```
## DataFrame with 5 rows and 1 column
## description
## <character>
## 1 PATH-1
## 2 PATH-2
## 3 PATH-3
## 4 PATH-4
## 5 PATH-5
```

Additional matrices can also be added if multiple statistics are available for each cell/path combination. For example, one might imagine assigning the confidence to which each cell is assigned to each path.

```
pathStat(pto, "confidence") <- matrix(runif(1000), ncells, npaths)
pathStatNames(pto)
```

`## [1] "" "confidence"`

For convenience, we also provide the `averagePseudotime()`

function to compute the average pseudotime for each cell.
This is sometimes necessary in applications that only expect a single set of pseudotime values, e.g., for visualization.

`summary(averagePseudotime(pto))`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.46135 -0.37279 -0.01841 -0.07571 0.22530 0.96807
```

`sessionInfo()`

```
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TrajectoryUtils_1.0.0 SingleCellExperiment_1.14.0
## [3] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
## [7] IRanges_2.26.0 S4Vectors_0.30.0
## [9] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
## [11] matrixStats_0.58.0 knitr_1.33
## [13] BiocStyle_2.20.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 highr_0.9 bslib_0.2.5.1
## [4] compiler_4.1.0 BiocManager_1.30.15 jquerylib_0.1.4
## [7] XVector_0.32.0 bitops_1.0-7 tools_4.1.0
## [10] zlibbioc_1.38.0 digest_0.6.27 jsonlite_1.7.2
## [13] evaluate_0.14 lattice_0.20-44 pkgconfig_2.0.3
## [16] rlang_0.4.11 igraph_1.2.6 Matrix_1.3-3
## [19] DelayedArray_0.18.0 magick_2.7.2 yaml_2.2.1
## [22] xfun_0.23 GenomeInfoDbData_1.2.6 stringr_1.4.0
## [25] sass_0.4.0 grid_4.1.0 R6_2.5.0
## [28] rmarkdown_2.8 bookdown_0.22 magrittr_2.0.1
## [31] htmltools_0.5.1.1 stringi_1.6.2 RCurl_1.98-1.3
```