# destiny main vignette: Start here!

library(base64enc)

suppressPackageStartupMessages({
library(destiny)
library(Biobase)
})

#setHook('on.rgl.close', function(...) {
#    name <- tempfile()
#    par3d(windowRect = c(0, 0, 1200, 1200))
#    Sys.sleep(1)
#
#    rgl.snapshot(  filename = paste0(name, '.png'))
#   #rgl.postscript(filename = paste0(name, '.pdf'), fmt='pdf')  # doesn’t work with spheres
#
#    publish_mimebundle(list(
#        'image/png'       = base64encode(paste0(name, '.png'))
#     #, 'application/pdf' = base64encode(paste0(name, '.pdf'))
#    ))
#}, 'replace')

Diffusion maps are spectral method for non-linear dimension reduction introduced by Coifman et al. (2005). Diffusion maps are based on a distance metric (diffusion distance) which is conceptually relevant to how differentiating cells follow noisy diffusion-like dynamics, moving from a pluripotent state towards more differentiated states.

The R package destiny implements the formulation of diffusion maps presented in Haghverdi, Buettner, and Theis (2015) which is especially suited for analyzing single-cell gene expression data from time-course experiments. It implicitly arranges cells along their developmental path, with bifurcations where differentiation events occur.

In particular we follow Haghverdi, Buettner, and Theis (2015) and present an implementation of diffusion maps in R that is less affected by sampling density heterogeneities and thus capable of identifying both abundant and rare cell populations. In addition, destiny implements complex noise models reflecting zero-inflation/censoring due to drop-out events in single-cell qPCR data and allows for missing values. Finally, we further improve on the implementation from Haghverdi, Buettner, and Theis (2015), and implement a nearest neighbour approximation capable of handling very large data sets of up to 300.000 cells.

For those familiar with R, and data preprocessing, we recommend the section Plotting.

All code in this vignette is accessible via edit(vignette('destiny'))

# Preprocessing of single qPCR data

As an example, we present in the following the preprocessing of data from Guo et al. (2010). This dataset was produced by the Biomark RT-qPCR system and contains Ct values for 48 genes of 442 mouse embryonic stem cells at 7 different developmental time points, from the zygote to blastocyst.

Starting at the totipotent 1-cell stage, cells transition smoothly in the transcriptional landscape towards either the trophoectoderm lineage or the inner cell mass. Subsequently, cells transition from the inner cell mass either towards the endoderm or epiblast lineage. This smooth transition from one developmental state to another, including two bifurcation events, is reflected in the expression profiles of the cells and can be visualized using destiny.

## Import

## # A tibble: 9 × 9
##   Cell   Actb  Ahcy  Aqp3 Atp12a  Bmp4  Cdx2 Creb312 Cebpa
##   <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 1C 1   14.0  19.3  23.9     28    28  21.3    20.8  28
## 2 1C 2   13.7  18.6  28       28    28  23.4    20.9  28
## 3 1C 3   13.4  18.2  26.2     28    28  22.9    19.6  28
## 4 1C 4   13.7  18.6  28       28    28  23.3    20.7  28
## 5 1C 5   13.5  18.6  24.2     28    28  24.2    21.8  23.7
## 6 1C 6   12.9  17.4  25.5     28    28  21.9    21.3  28
## 7 1C 7   13.0  17.4  23.9     28    28  22.7    21.1  28
## 8 1C 8   12.8  18.4  23.7     28    28  24.1    19.8  28
## 9 1C 9   13.3  18.3  28       28    28  21.9    21.2  28

The value 28 is the assumed background expression of 28 cycle times.

In order to easily clean and normalize the data without mangling the annotations, we convert the data.frame into a Bioconductor ExpressionSet using the as.ExpressionSet function from the destiny package, and assign it to the name ct:

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48 features, 442 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: 1 2 ... 442 (442 total)
##   varLabels: Cell
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

The advantage of ExpressionSet over data.frame is that tasks like normalizing the expressions are both faster and do not accidentally destroy annotations by applying “normalization” to columns that are not expressions. The approach of handling a separate expression matrix and annotation data.frame requires you to be careful when adding or removing samples on both variables, while ExpressionSet does it internally for you.

## [1] 10
##
## $max_cycles_norm ## [1] 25 We then also need to set the normalized values that should be censored – namely all data points were no expression was detected after the LoD – to this special value, because the values at the cycle threshold were changed due to normalization. This version of the dataset is avaliable as data(guo) from the destiny package. Now we call the the DiffusionMap function using the censoring model: Compared to the diffusion map created without censoring model, this map looks more homogeneous since it contains more data points. ## Missing values Gene expression experiments may fail to produce some data points, conventionally denoted as “not available” (NA). By calling DiffusionMap(..., missings = c(total.minimum, total.maximum)), you can specify the parameters for the missing value model. As in the data from Guo et al. (2010) no missing values occurred, we illustrate the capacity of destiny to handle missing values by artificially treating ct values of 999 (i. e. data points were no expression was detected after 40 cycles) as missing. This is purely for illustrative purposes and in practice these values should be treated as censored as illustrated in the previous section. We then perform normalization on this version of the data: Finally, we create a diffusion map with both missing value model and the censoring model from before: This result looks very similar to our previous diffusion map since only six additional data points have been added. However if your platform creates more missing values, including missing values will be more useful. # Prediction In order to project cells into an existing diffusion map, for example to compare two experiments measured by the same platform or to add new data to an existing map, we implemented dm.predict. It calculates the transition probabilities between datapoints in old and new data and projects cells into the diffusion map using the existing diffusion components. As an example we assume that we created a diffusion map from one experiment on 64 cell stage embryos: ct64 <- guo[, guo$num_cells == 64]
dm64 <- DiffusionMap(ct64)

Let us compare the expressions from the 32 cell state embryos to the existing map. We use dm.predict to create the diffusion components for the new cells using the existing diffusion components from the old data:

ct32 <- guo[, guo\$num_cells == 32]
pred32 <- dm_predict(dm64, ct32)

By providing the more and col.more parameters of the plot function, we show the first two DCs for both old and new data:

plot(dm64,    1:2,     col     = palette()[[6]],
new_dcs = pred32, col_new = palette()[[4]])

Clearly, the 32 and 64 cell state embryos occupy similar regions in the map, while the cells from the 64 cell state are developed further.

# Troubleshooting

There are several properties of data that can yield subpar results. This section explains a few strategies of dealing with them:

Preprocessing: if there is a strong dependency of the variance on the mean in your data (as for scRNA-Seq count data), use a variance stabilizing transformation such as the square root or a (regularized) logarithm before running destiny.

Outliers: If a Diffusion Component strongly separates some outliers from the remaining cells such that there is a much greater distance between them than within the rest of the cells (i. e. almost two discrete values), consider removing those outliers and recalculating the map, or simply select different Diffusion Components. It may also a be a good idea to check whether the outliers are also present in a PCA plot to make sure they are not biologically relevant.

Large datasets: If memory is not sufficient and no machine with more RAM is available, the k parameter could be decreased. In addition (particularly for >500,000 cells), you can also downsample the data (possibly in a density dependent fashion).

“Large-p-small-n” data: E.g. for scRNA-Seq, it is may be necessary to first perform a Principal Component Analysis (PCA) on the data (e.g. using prcomp or princomp) and to calculate the Diffusion Components from the Principal Components (typically using the top 50 components yields good results).

# References

Coifman, R. R., S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. 2005. “Geometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps.” PNAS. http://www.pnas.org/content/102/21/7426.long.

Guo, Guoji, Mikael Huss, Guo Qing Tong, Chaoyang Wang, Li Li Sun, Neil D. Clarke, and Paul Robson. 2010. “Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst.” Developmental Cell 18 (4): 675–85. https://doi.org/10.1016/j.devcel.2010.02.012.

Haghverdi, Laleh, Florian Buettner, and Fabian J. Theis. 2015. “Diffusion Maps for High-Dimensional Single-Cell Analysis of Differentiation Data.” Bioinformatics, no. in revision.