# Additional Community Typing

## K-means clustering

Let's now try k-means clustering. Here observations are divided into clusters so
that the distances between observations and cluster centers are
minimized; an observation belongs to cluster whose center is the
nearest.

The algorithm starts by dividing observation to random clusters whose
number is defined by user. The centroids of the clusters are then
calculated. After that, observations' allocation to clusters are
updated so that the means are minimized. Again, the centroids are
calculated, and the algorithm continues iteratively until the assignments
do not change.

As an alternative to `NbClust` get the optimal number of clusters, we can 
visualize the silhouette analysis thanks to the library `factoextra`.

```{r kmeans1}
library(factoextra)

# Convert dist object into matrix
diss <- as.matrix(diss)

# Perform silhouette analysis and plot the result
fviz_nbclust(diss, kmeans, method = "silhouette")
```

Based on the result of silhouette analysis, we confirm that 2 is the optimal
number of clusters in k-means clustering.

```{r kmeans2}
# The first step is random, add seed for reproducibility
set.seed(15463)

# Perform k-means clustering with 2 clusters
km <- kmeans(diss, 2, nstart = 25)

# Add the result to colData
colData(tse)$clusters <- as.factor(km$cluster)

# Perform PCoA so that we can visualize clusters
tse <- addMDS(tse, assay.type = "relabundance", 
              FUN = vegan::vegdist, method = "bray")

# Plot PCoA and color clusters
plotReducedDim(tse, "MDS", colour.by = "clusters")
```

```{r data, message=FALSE, warning=FALSE}
# Load data
library(mia)
library(microbiomeDataSets)
tse <- SprockettTHData()
```


## Graph-based clustering

Another approach for discovering communities within the samples of the
data, is to run community detection algorithms after building a
graph. The following demonstration builds a graph based on the k
nearest-neighbors and performs the community detection on the fly.

Here, we will be using the `addCluster` function with a graph-based clustering 
function, enabling the community detection task.

The algorithm used is "short random walks" [@Pons2006]. The graph is
constructed using different k values (the number of nearest neighbors
to consider during graph construction) using the robust centered log
ratio (rclr) assay data. Then plotting the communities using UMAP
[@McInnes2018] ordination as a visual exploration aid. Let us cluster the 
`enterotype` dataset.

```{r NNGraph1, message=FALSE, warning=FALSE}
library(patchwork) # For arranging several plots as a grid

# Get enterotype dataset and transform data with rclr
tse <- enterotype
tse <- transformAssay(tse, method = "rclr")

# Perform and store UMAP
tse <- runUMAP(tse, name = "UMAP", assay.type = "rclr")

# Set different k values to test
k <- c(2, 3, 5, 10)

ClustAndPlot <- function(x) {
    # Add the clustering data from the short random walks algorithm to the TSE
    tse <- addCluster(tse, assay.type = "rclr", 
                   by = "cols", NNGraphParam(k = x))
    
    # Plot the results of the clustering as a color for each sample
    plotUMAP(tse, colour.by = I(colData(tse)$clusters)) +
    labs(title = paste0("k = ", x))
}

# Apply the function for different k values
plots <- lapply(k, ClustAndPlot)

# Display plots in a grid
plots[[1]] + plots[[2]] + plots[[3]] + plots[[4]] + plot_layout(ncol = 4)
```

In this graph, we can clearly see the impact of the k choice on the quality
of the clustering

Similarly, the _`bluster`_ [@R_bluster] package offers clustering
diagnostics that can be used for judging the clustering quality (see
[Assorted clustering
diagnostics](http://bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html)).
In the following, Silhouette width as a diagnostic tool is computed
and results are visualized for each case presented earlier. For more
about Silhouettes read [@Rousseeuw1987].

```{r NNGraph2, message=FALSE, warning=FALSE}
ClustDiagPlot <- function(x) {
  # Get the clustering results
    tse <- addCluster(tse, assay.type = "rclr", 
                   by = "cols", NNGraphParam(k = x))

    # Compute the diagnostic info
    sil <- approxSilhouette(t(assays(tse)$rclr), colData(tse)$clusters)

    # Plot as a boxplot to observe cluster separation
    boxplot(split(sil$width, colData(tse)$clusters), main = paste0("k = ", x))
}
# Apply the function for different k values
res <- lapply(k, ClustDiagPlot)
```

## Community composition

### Composition barplot

A typical way to visualize microbiome composition is by using a composition barplot.
In the following, we agglomerate to the phylum level and subset by the country "Finland" to avoid long computation times. The samples in the barplot are ordered by "Firmicutes":

```{r, message=FALSE, warning=FALSE}
library(miaViz)

# Only consider Forest samples
tse <- tse[, colData(tse)$Country == "Finland"]
# Agglomerate to phylum level
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
# Get top taxa
rel_taxa <- getTop(tse, top = 8, assay.type = "counts")
# Take only the top taxa
tse <- tse[is.element(rownames(tse), rel_taxa), ]

# Visualise composition barplot, with samples order by "Firmicutes"
plotAbundance(tse, rank = "Phylum", order.row.by = "abund", order.col.by = "Firmicutes")
```

### Composition heatmap

The community composition can be visualized with a heatmap where one axis represents the samples and the other taxa. The color of each line represents the abundance of a taxon in a specific sample.

Here, the CLR + Z-transformed abundances are shown.

```{r, message=FALSE, warning=FALSE}
library(ComplexHeatmap)
library(grid)
library(RColorBrewer)
# Agglomerate to phylum level
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
# Take only the top taxa
tse <- tse[is.element(rownames(tse), rel_taxa),]

# CLR and standardize transforms
tse <- transformAssay(tse, MARGIN = "cols", assay.type = "counts",
                       method = "clr", pseudocount=1)
tse <- transformAssay(tse, MARGIN = "rows", assay.type = "clr", method = "standardize")

Countries <- data.frame("Country" = colData(tse)$Country)
rownames(Countries) <- colData(tse)$Sample_ID
# count matrix for pheatmap
mat <- assay(tse, "standardize")

# Order community types
mat <- mat[, order(Countries$Country)]
colnames(mat) <- colnames(mat)[order(Countries$Country)]
rownames(mat) <- stringr::str_remove(rownames(mat), "Phylum:")
# Make grid for heatmap
breaks <- seq(-3, 3, length.out = 10)
setHook("grid.newpage", function() pushViewport(viewport(x = 1, y = 1, width = 0.9, 
                                                         height = 0.9, name = "vp", 
                                                         just = c("right","top"))), 
        action = "prepend")
pheatmap(mat, color = rev(brewer.pal(9, "RdBu")), breaks = breaks, main = "Countries", treeheight_row = 0, treeheight_col = 0, show_colnames = 0, annotation_col = Countries, cluster_cols = F)
setHook("grid.newpage", NULL, "replace")
grid.text("Sample", x = 0.39, y = -0.04, gp = gpar(fontsize = 16))
grid.text("Phylum", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16))
```

## Cluster into CSTs

The burden of specifying the number of clusters falls on the researcher. To help make an informed decision, we turn to previously established methods for doing so. In this section we introduce three such methods (aside from DMM analysis) to cluster similar samples. They include the [Elbow Method, Silhouette Method, and Gap Statistic Method](https://uc-r.github.io/kmeans_clustering). All of them will utilize the [`kmeans'](https://uc-r.github.io/kmeans_clustering) algorithm which essentially assigns clusters and minimizes the distance within clusters (a sum of squares calculation). The default distance metric used is the Euclidean metric.

The scree plot allows us to see how much of the variance is captured by each dimension in the MDS ordination.

```{r scree}
library(ggplot2); th <- theme_bw()

# Only consider Finland samples
tse <- tse[, colData(tse)$Country == "Finland"]

# MDS analysis with the first 20 dimensions
tse  <- scater::addMDS(tse, FUN = vegan::vegdist, method = "bray", 
                       name = "MDS_BC", assay.type = "counts", ncomponents = 20)
ord  <- reducedDim(tse, "MDS_BC", withDimnames = TRUE)
# retrieve eigenvalues
eigs <- attr(ord, "eig")

# variance in each of the axes
var <- eigs / sum(eigs)
# first 12 values
df <- data.frame(x = c(1:12), y = var[1:12])
# create scree plot
p <- ggplot(df, aes(x, y)) +
     geom_bar(stat="identity") +
     xlab("Principal Component") +
     ylab("Variance") +
     ggtitle("Scree Plot")
p
```

From the scree plot (above), we can see that the first two dimensions can account for 15.5% of the total variation, but dimensions beyond 2 may be useful so let's try to remove the less relevant dimensions.

```{r}
# histogram of MDS eigenvalues from the fifth dimension onward
h <- hist(eigs[3:length(eigs)], 100)
```

```{r message = FALSE, warning = FALSE}
plot(h$mids, h$count, log = "y", type = "h", lwd = 10, lend = 2)
```

As you can see, dimensions 3, 4, and 5 still stand out so we will include 5 MDS dimensions.

### Elbow Method

This method graphs the sum of the sum of squares for each $k$ (number of clusters), where $k$ ranges between 1 and 10. Where the bend or `elbow' occurs is the optimal value of $k$.

```{r elbow, message = FALSE}
library(factoextra)

# take only first 5 dimensions
NDIM <- 5
x    <- ord[, 1:NDIM]

# Elbow Method
factoextra::fviz_nbclust(x, kmeans, method = "wss") +
                         geom_vline(xintercept = 3, linetype = 2) +
                         labs(subtitle = "Elbow Method") + th
```

The function says that the bend occurs at $k=3$, however it is hard to tell that the bend couldn't equally occur at $k=4$, $5$, or $6$.

### Silhouette Method

This method on the otherhand returns a width for each $k$. In this case, we want the $k$ that maximizes the width.

```{r silhouette}
# Silhouette method
factoextra::fviz_nbclust(x, kmeans, method = "silhouette") +
                         labs(subtitle = "Silhouette method") + th
```

The graph shows the maximum occurring at $k=6$. At the very least, there is strong evidence for $\geq k=2$ clusters because of the jump in the plot; this result is consistent with what we obtained from the elbow method.

### Gap-Statistic Method

The Gap-Statistic Method is the most complicated among the methods discussed here. With the gap statistic method, we typically want the $k$ value that maximizes the output (local and global maxima), but we also want to pay attention to where the plot jumps if the maximum value doesn't turn out to be helpful. 

```{r gap-statistic}
# Gap Statistic Method
factoextra::fviz_nbclust(x, kmeans, method = "gap_stat", nboot = 50)+
                         labs(subtitle = "Gap Statistic Method") + th
```

The peak suggests $k=6$ clusters. If we also look to the points where the graph jumps, we can see there is evidence for $k=2$, $k=6$, and $k=8$. The output indicates that there should be at least three clusters present. Since we have previous evidence for the existence of six clusters from the silhouette and elbow methods, we will go with $k=6$. 

At this point it helps to visualize the clustering in an MDS or NMDS plot. 

Now, let's divide the subjects into their respective clusters.

```{r create clusters}
library(cluster)
tse2 <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)

# assume 6 clusters
K <- 6
x <- ord[, 1:NDIM]

clust <- as.factor(pam(x, k = K, cluster.only = TRUE))
# assign CSTs
colData(tse2)$CST <- clust
CSTs <- as.character(seq(K))
```

Let's take a look at the MDS ordination with the Bray-Curtis dissimilarity in the first four dimensions.

```{r message = FALSE, warning = FALSE}
library(scater)
library(RColorBrewer)
library(patchwork)

# set up colors
CSTColors <- brewer.pal(6, "Paired")[c(2, 5, 3, 4, 1, 6)]
names(CSTColors) <- CSTs

CSTColorScale <- scale_colour_manual(name = "CST", values = CSTColors)
CSTFillScale <- scale_fill_manual(name = "CST", values = CSTColors)

# plot MDS with Bray-Curtis dimensions 1 and 2
p1 <- scater::plotReducedDim(tse2, "MDS_BC", colour_by = "CST", point_alpha = 1, 
                             percentVar = var[c(1, 2)]*100) + th + labs(title = "Ordination by Cluster") +
                             theme(plot.title = element_text(hjust = 0.5))
# plot MDS with Bray-Curtis dimensions 3 and 4
p2 <- scater::plotReducedDim(tse2, "MDS_BC", colour_by = "CST", point_alpha = 1, 
                             ncomponents = c(3, 4), percentVar = var[c(1, 2, 3, 4)]*100) + th
# show results
(p1 + CSTColorScale) / (p2 + CSTColorScale)
```

And now nMDS.

```{r message = FALSE, warning = FALSE}
tse2  <- runNMDS(tse2, FUN = vegan::vegdist, method = "bray", 
                name = "NMDS_BC", assay.type = "counts", ncomponents = 20)
scater::plotReducedDim(tse2, "NMDS_BC", colour_by = "CST", point_alpha = 1) + th + 
        labs(title = "NMDS Bray-Curtis by Cluster") +
        theme(plot.title = element_text(hjust = 0.5)) + CSTColorScale
```

## CST Analysis

Looking at heatmaps of the CSTs can reveal structure on a different level. Using all of the data again (for the top 8 taxa) and taking the `standardize` transformation of the clr transformed counts, we have

```{r message = FALSE, warning = FALSE, results = FALSE}

# Z transform of CLR counts
tse2 <- transformAssay(tse2, MARGIN = "cols", assay.type = "counts",
                        method = "clr", pseudocount=1)
tse2 <- transformAssay(tse2, MARGIN = "rows", assay.type = "clr", method = "standardize")
# get top taxa
tse2 <- tse2[is.element(rownames(tse2), rel_taxa), ]

mat <- assay(tse2, "standardize")

# Order CSTs
mat <- mat[, order(clust)]
colnames(mat) <- names(sort(clust))
rownames(mat) <- stringr::str_remove(rownames(mat), "Phylum:")
```

```{r messages = FALSE, warning = FALSE}
# Plot
CSTs        <- as.data.frame(sort(clust))
names(CSTs) <- "CST"
breaks <- seq(-2, 2, length.out = 10)
# Make grid for heatmap
setHook("grid.newpage", function() pushViewport(viewport(x = 1, y = 1, width = 0.9, 
                                                         height = 0.9, name = "vp", 
                                                         just = c("right","top"))), 
        action = "prepend")
pheatmap(mat, color = rev(brewer.pal(9, "RdBu")), breaks = breaks, main = "All CSTs", treeheight_row = 0, treeheight_col = 0, show_colnames = 0, annotation_col = CSTs, cluster_cols = F)
setHook("grid.newpage", NULL, "replace")
grid.text("Sample", x = 0.39, y = -0.04, gp = gpar(fontsize = 16))
grid.text("Phylum", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16))
```

## Dirichlet Multinomial Mixtures (DMM)

This section focus on DMM analysis. 

A different technique that allows us to search for groups of samples that are
similar to one another is the [Dirichlet-Multinomial Mixture
Model](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030126). In
DMM, we first determine the number of clusters (k) that best fit the
data (model evidence) using the Laplace approximation. After fitting the
model with k clusters, we obtain, for each sample, k probabilities that
reflect the probability that a sample belongs to a given cluster.

Let's cluster the data with DMM clustering. 

```{r dmm}
# Runs model and calculates the most likely number of clusters from 1 to 7.
# Since this is a large dataset it takes a long time to run
# For this reason we use only a subset of the data agglomerated to the phylum level
tse <- agglomerateByRank(tse, rank = "Phylum", agglomerateTree=TRUE)
tse_dmn <- runDMN(tse, name = "DMN", k = 1:7)
```

```{r}
# It is stored in metadata
tse_dmn
```

Return information on metadata that the object contains.

```{r}
names(metadata(tse_dmn))
```

This returns a list of DMN objects for closer investigation.

```{r}
getDMN(tse_dmn)
```


Show Laplace approximation (model evidence) for each model of every value of $k$.

```{r}
library(miaViz)
plotDMNFit(tse_dmn, type = "laplace")
```

Return the model that has the best fit.

```{r}
getBestDMNFit(tse_dmn, type = "laplace")
```
### PCoA for ASV-level data with Bray-Curtis

Group samples and return DMNGroup object that contains a summary.
Sample country is used for grouping.

```{r}
dmn_group <- calculateDMNgroup(tse_dmn, variable = "Country",  assay.type = "counts",
                               k = 3, seed=.Machine$integer.max)

dmn_group
```

Mixture weights (rough measure of the cluster size).


```{r}
DirichletMultinomial::mixturewt(getBestDMNFit(tse_dmn))
```


Samples-cluster assignment probabilities / how probable it is that a sample belongs
to each cluster

```{r}
DirichletMultinomial::mixture(getBestDMNFit(tse_dmn)) |> head()
```

Contribution of each taxa to each component

```{r}
DirichletMultinomial::fitted(getBestDMNFit(tse_dmn)) |> head()
```
Get the assignment probabilities

```{r}
prob <- DirichletMultinomial::mixture(getBestDMNFit(tse_dmn))
# Add column names
colnames(prob) <- paste0("comp", seq_len(ncol(prob)))

# For each row, finds column that has the highest value. Then extract the column 
# names of highest values.
vec <- colnames(prob)[max.col(prob,ties.method = "first")]
```

Computing the bray PCoA and storing it as a dataframe

```{r}
# Calculate relative abundances
tse_dmn <- transformAssay(tse_dmn, method = "relabundance")
# Does principal coordinate analysis
bray_pcoa_df <- getMDS(tse_dmn, FUN = vegan::vegdist, method = "bray", 
                             assay.type = "relabundance")

# Convert to data.frame
bray_pcoa_df <- as.data.frame(bray_pcoa_df)
colnames(bray_pcoa_df) <- c("pcoa1", "pcoa2")
bray_pcoa_df |> head()
```

```{r}
# Creates a data frame that contains principal coordinates and DMM information
bray_dmm_pcoa_df <- bray_pcoa_df 
bray_dmm_pcoa_df$dmm_component <- vec
# Creates a plot
bray_dmm_plot <- ggplot(data = bray_dmm_pcoa_df, 
                        aes(x = pcoa1, y = pcoa2, color = dmm_component)) +
  geom_point() +
  labs(x = "Coordinate 1",
       y = "Coordinate 2",
       title = "PCoA with Bray-Curtis Dissimilarity") +  
  theme(title = element_text(size = 12)) + theme_bw() # makes titles smaller

bray_dmm_plot
```

Visualise dmm clusters in a heatmap.

```{r}
# CLR and standardize transforms
tse_dmn <- transformAssay(tse_dmn, MARGIN = "cols",
                           assay.type = "counts", method = "clr",
			   pseudocount = 1)
tse_dmn <- transformAssay(tse_dmn, MARGIN = "rows",
                           assay.type = "clr", method = "standardize")
# objects containing dmm component information
clust <- factor(vec)
names(clust) <- colnames(tse_dmn)
# get top taxa
tse_dmn <- tse_dmn[is.element(rownames(tse_dmn), rel_taxa), ]
# get just counts
mat <- assay(tse_dmn, "standardize")
# order according to dmm component
mat <- mat[, order(clust)]
colnames(mat) <- names(sort(clust))
rownames(mat) <- stringr::str_remove(rownames(mat), "Phylum:")
# Plot
CSTs        <- as.data.frame(sort(clust))
names(CSTs) <- "CST"
breaks <- seq(-2, 2, length.out = 10)
# Make grid for heatmap
setHook("grid.newpage", function() pushViewport(viewport(x = 1, y = 1, width = 0.9, 
                                                         height = 0.9, name = "vp", 
                                                         just = c("right","top"))), 
        action = "prepend")
pheatmap(mat, color = rev(brewer.pal(9, "RdBu")), breaks = breaks, main = "All CSTs", treeheight_row = 0, treeheight_col = 0, show_colnames = 0, annotation_col = CSTs, cluster_cols = F)
setHook("grid.newpage", NULL, "replace")
grid.text("Sample", x = 0.39, y = -0.04, gp = gpar(fontsize = 16))
grid.text("Phylum", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16))
```

## Session Info

```{r}
sessionInfo()
```

## Bibliography

Robert L. Thorndike. 1953. "Who Belongs in the Family?". Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263

Peter J. Rousseeuw. 1987. "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.

Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2002. Estimating the number of clusters in a data set via the gap statistic method. (63): 411-423. doi:10.1111/1467-9868.00293.

Daniel B. DiGiulio et al. 2015. Temporal and spatial variation of the human microbiota during pregnancy. (112): 11060--11065. doi:10.1073/pnas.1502875112

Holmes I, Harris K, Quince C, 2012 Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. PLoS ONE 7(2): e30126. doi:10.1371/journal.pone.0030126.

Sprockett, D.D., Martin, M., Costello, E.K. et al. (2020) Microbiota assembly, structure, and dynamics among Tsimane horticulturalists of the Bolivian Amazon. Nat Commun 11, 3772 https://doi.org/10.1038/s41467-020-17541-6
