Contents

1 Introduction

The goal of this package is to encourage the user to try many different clustering algorithms in one package structure, and we provide strategies for creating a unified clustering from these many clustering resutls. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding cluster-specific markers via differential expression, etc).

The other main goal of this package is to implement strategies that we have developed in the RSEC algorithm (Resampling-based Sequential Ensemble Clustering) for finding a single robust clustering based on the many clusterings that the user might create by perturbing various parameters of a clustering algorithm. There are several steps to these strategies that we call our standard clustering workflow. The RSEC function is our preferred realization of this workflow that depends on subsampling and other ensemble methods to provide robust clusterings, particularly for single-cell sequencing experiments and other large mRNA-Seq experiments.

We also provide a class ClusterExperiment that inherits from SummarizedExperiment to store the many clusterings and related information, and a class ClusterFunction that encodes a clustering routine so that users can create customized clustering routines in a standardized way that can interact with our clustering workflow algorithms.

All of our methods also have a barebones version that allows input of matrices and greater control. This comes at the expense of the user having to manage and keep track of the clusters, input data, transformation of the data, etc. We do not discuss these barebone versions in this tutorial. Instead, we focus on using the SummarizedExperiment object as the input and working with the resulting ClusterExperiment object. See the help pages of each method for more on how to allow for matrix input.

Although this package was developed with (single-cell) RNA-seq data in mind, its use is not limited to RNA-seq or even to gene expression data. Any dataset characterized by high dimensionality could benefit from the methods implemented here.

1.1 The RSEC clustering workflow

The package encodes many common practices that are shared across clustering algorithms, like subsampling the data, computing silhouette width, sequential clustering procedures, and so forth. It also provides novel strategies that we developed as part of the RSEC algorithm (Resampling-based Sequential Ensemble Clustering) .

As mentioned above, RSEC is a specific algorithm for creating a clustering that follows these basic steps:

  • Implement many different clusterings using different choices of parameters using the function clusterMany. This results in a large collection of clusterings, where each clustering is based on different parameters.
  • Find a unifying clustering across these many clusterings using the makeConsensus function.
  • Determine whether some clusters should be merged together into larger clusters. This involves two steps:
    • Find a hierarchical clustering of the clusters found by makeConsensus using makeDendrogram
    • Merge together clusters of this hierarchy based on the percentage of differential expression, using mergeClusters.

The basic premise of RSEC is to find small, robust clusters of samples, and then merge them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters for methods err on the side of too few clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data.

The RSEC function is a wrapper around these steps that makes many specific choices in this basic workflow. However, many steps of this workflow are useful for users separately and for this reason, the clusterExperiment package generalizes this workflow so that the user can follow this workflow with their own choices. We call this generalization the clustering workflow, as oppose to the specific choices set in RSEC.

Users can also run or add their own clusters to this workflow at different stages. Additional functionality for creating a single clustering is also available in the clusterSingle function, and the user should see the documentation in the help page of that function.

2 Quickstart

In this section we give a quick introduction to the package and the RSEC wrapper which creates the clustering. We will also demonstrate how to find features (biomarkers) that go along with the clusters.

2.1 The Data

We will make use of a small single cell RNA sequencing experiment produced by Fluidigm and made available in the scRNAseq package. Fluidigm’s original data (and that provided by scRNASeq) contained 130 samples, but there are only 65 actual cells, because each cell library is sequenced twice at different sequencing depth. We have provided within the clusterExperiment package a subset of this data set, limited to only those 65 cells sequenced at high depth1 scRNAseq also provides multiple gene summaries of the data, and we save only the “tophat_counts” and “rsem_tpm” values.. See ?fluidigmData to see the code to recreate the data we provide here.

We will load the datasets, and data containing information about each of the samples, and then store that information together in a SummarizedExperiment object.

set.seed(14456) ## for reproducibility, just in case
library(clusterExperiment)
data(fluidigmData) ## list of the two datasets (tophat_counts and rsem_tpm)
data(fluidigmColData)
se<-SummarizedExperiment(fluidigmData,colData=fluidigmColData)

We can access the gene counts data with assay and the metadata on the samples with colData.

assay(se)[1:5,1:10]
##          SRR1275356 SRR1275251 SRR1275287 SRR1275364 SRR1275269 SRR1275263
## A1BG              0          0          0          0          0          0
## A1BG-AS1          0          0          0          0          0          0
## A1CF              0          0          0          0          0          0
## A2M               0          0         31          0         46          0
## A2M-AS1           0          0          0          0          0          0
##          SRR1275338 SRR1274117 SRR1274089 SRR1274125
## A1BG              0          0          0         10
## A1BG-AS1          0          0          0          0
## A1CF              0          0          0          0
## A2M               0         29          0          0
## A2M-AS1           0        133          0          0
colData(se)[,1:5]
## DataFrame with 65 rows and 5 columns
##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
##            <numeric> <numeric> <numeric> <numeric> <numeric>
## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
## SRR1275269  10729700   7806230   72.7536   50.4285 0.0204349
## ...              ...       ...       ...       ...       ...
## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522
NCOL(se) #number of samples
## [1] 65

2.2 Filtering and normalization

We will filter out lowly expressed genes: we retain only those genes with at least 10 reads in at least 10 cells.

wh_zero <- which(rowSums(assay(se))==0)
pass_filter <- apply(assay(se), 1, function(x) length(x[x >= 10]) >= 10)
se <- se[pass_filter,]
dim(se)
## [1] 7069   65

This removed 19186 genes out of 26255. We now have 7069 genes (or features) remaining. Notice that it is important to at least remove genes with zero counts in all samples (we had 9673 genes which were zero in all samples here). Otherwise, PCA dimensionality reductions and other implementations may have a problem.

Normalization is an important step in any RNA-seq data analysis and many different normalization methods have been proposed in the literature. Comparing normalization methods or finding the best performing normalization in this dataset is outside of the scope of this vignette. Instead, we will use a simple quantile normalization that will at least make our clustering reflect the biology rather than the difference in sequencing depth among the different samples.

#provides matrix of normalized counts
fq <- round(limma::normalizeQuantiles(assay(se)))

SummarizedExperiment objects allow for the storage of multiple data matrices of the same dimensions, so we will add this normalized data as an additional assay in our se object. Note that we will put the normalized counts first, since by default the clusterExperiment package uses the first assay in the object (see Working with other assays for how to switch between different assays)

assays(se) <- c(SimpleList(normalized_counts=fq),assays(se))
assays(se)
## List of length 3
## names(3): normalized_counts tophat_counts rsem_tpm

We see that we’ve added the normalized counts to our two existing datasets already (we’ve been working with the tophat counts, as the first assay, but also saved in our object are RSEM TPM values).

As one last step, we are going to change the name of the columns “Cluster1” and “Cluster2” that came in the dataset; they refer to published clustering results from the paper. We will use the terms “Published1” and “Published2” to better distinguish them in later plots from other clustering we will do.

wh<-which(colnames(colData(se)) %in% c("Cluster1","Cluster2"))
colnames(colData(se))[wh]<-c("Published1","Published2")

2.3 Clustering with RSEC

We will now run RSEC to find clusters of the cells using the default settings. We set isCount=TRUE to indicate that the data in se is count data, so that the log-transform and other count methods should be applied. We also choose the number of cores on which we want to run the operation in parallel via the parameter ncores. This is a relatively small number of samples, compared to most single-cell sequencing experiments, so we choose cluster on the top 10 PCAs of the data by setting reduceMethod="PCA" and nReducedDims=10 (the default is 50). Finally, we set the minimum cluster size in our ensemble clustering to be 3 cells (consensusMinSize=3). We do this not for biological reasons, but for instructional purposes to allow us to get a larger number of clusters.

Because this procedure is slightly computationally intensive, depending on the user’s machine, we have saved the results from this call as a data object in the package and will use the following code to load it into the vignette:

data("rsecFluidigm")
# package version the object was created under
metadata(rsecFluidigm)$packageVersion
## [1] '2.7.2.9001'

However, it doesn’t take very long (roughly 1-2 minutes or less) so we recommend users try it themselves. The following code is the basic RSEC call corresponding to the above object (the vignette does not run this code but the user can):

## Example of RSEC call (not run by vignette)
## Will overwrite the rsecFluidigm brought in above.
library(clusterExperiment)
system.time(rsecFluidigm<-RSEC(se, isCount = TRUE, 
    reduceMethod="PCA", nReducedDims=10,
    k0s = 4:15, 
    alphas=c(0.1, 0.2, 0.3),
    ncores=1, random.seed=176201))

However, to exactly replicate the data object rsecFluidigm that is provided in the package, you would need to set a large number of parameters to be exactly the same as this object (in different versions of the package, some defaults have changed and we try to not change the rsecFluidigm object everytime). The code that creates rsecFluidigm can be found by typing makeRsecFluidigmObject in the command line (see ?rsecFluidigm).

2.3.1 The output

We can look at the object that was created.

rsecFluidigm
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: mad_makeConsensus 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 m01 m02 m03 m04 m05 
##  26  13   4   4   3  15 
## Total number of clusterings: 38 
## Dendrogram run on 'makeConsensus' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes

The print out tells us about the clustering(s) that were created (namely 38 clusterings) and which steps of the workflow have been called (all of them have because we used the wrapper RSEC that does the whole thing). Recall from our brief description above that RSEC clusters the data many times using different parameters before finding an consensus clustering. All of these intermediate clusterings are saved. Each of these intermediate clusterings used subsampling of the data and sequential clustering of the data to find the clustering, while the different clusterings represent the different parameters that were adjusted.

We can see that rsecFluidigm has a built in (S4) class called a ClusterExperiment object. This is a class built for this package and explained in the section on ClusterExperiment Objects. In the object rsecFluidigm the clusterings are stored along with corresponding information for each clustering. Furthermore, all of the information in the original SummarizedExperiment is retained. The print out also tells us information about the “primaryCluster” of rsecFluidigm. Each ClusterExperiment object has a “primaryCluster”, which is the default cluster that the many functions will use unless specified by the user. We are told that the “primaryCluster” for rsecFluidigm is has the label “mergeClusters” – which is the defaul label given to the last cluster of the RSEC function because the last call of the RSEC function is to mergeClusters.

There are many accessor functions that help you get at the information in a ClusterExperiment object and some of the most relevant are described in the section on ClusterExperiment Objects. (ClusterExperiment objects are S4 objects, and are not lists).

For right now we will only mention the most basic such function that retrieves the actual cluster assignments. The final clustering created by RSEC is saved as the primary clustering of the object.

head(primaryCluster(rsecFluidigm),20)
##  [1] -1 -1  3  2 -1 -1 -1  5  5  5  1  1 -1  1 -1 -1 -1  5 -1 -1
tableClusters(rsecFluidigm)
## mergeClusters
##  -1 m01 m02 m03 m04 m05 
##  26  13   4   4   3  15

The clusters are encoded by consecutive integers. Notice that some of the samples are assigned the value of -1. -1 is the value assigned in this package for samples that are not assigned to any cluster. Why certain samples are not clustered depends on the underlying choices of the clustering routine and we won’t get into here until we learn a bit more about RSEC. Another special value is -2 discussed in the section on ClusterExperiment objects

This final result of RSEC is the result of running many clusterings and finding the ensembl consensus between them. All of the these intermediate clusterings are saved in rsecFluidigm object. They can be accessed by the clusterMatrix function, that returns a matrix where the columns are the different clusterings and the rows are samples. We show a subset of this matrix here:

head(clusterMatrix(rsecFluidigm)[,1:4])
##            mergeClusters makeConsensus k0=4,alpha=0.1 k0=5,alpha=0.1
## SRR1275356            -1            -1             -1             -1
## SRR1275251            -1            -1             -1             -1
## SRR1275287             3             5             -1             -1
## SRR1275364             2             4              3             -1
## SRR1275269            -1            -1             -1             -1
## SRR1275263            -1            -1             -1             -1

The “mergeClusters” clustering is the final clustering from RSEC and matches the primary clustering that we saw above. The “makeConsensus” clustering is the result of the initial ensembl concensus among all of the many clusterings that were run, while “mergeClusters” is the result of merging smaller clusters together that did not show enough signs of differences between clusters. The remaining clusters are the result of changing the parameters, and a couple of such clusterings a shown in the above printout of the cluster matrix.

The column names are the clusterLabels for each clustering and can be accessed (and assigned new values!) via the clusterLabels function.

head(clusterLabels(rsecFluidigm))
## [1] "mergeClusters"  "makeConsensus"  "k0=4,alpha=0.1" "k0=5,alpha=0.1"
## [5] "k0=6,alpha=0.1" "k0=7,alpha=0.1"

We can see the names of more clusterings, and see that the different parameter values tried in each clustering are saved in the names of the clustering. We can also see the different parameter combinations that went into the consensus clustering by using getClusterManyParams (here only 2 different parameters).

head(getClusterManyParams(rsecFluidigm))
##                clusteringIndex k alpha
## k0=4,alpha=0.1               3 4   0.1
## k0=5,alpha=0.1               4 5   0.1
## k0=6,alpha=0.1               5 6   0.1
## k0=7,alpha=0.1               6 7   0.1
## k0=8,alpha=0.1               7 8   0.1
## k0=9,alpha=0.1               8 9   0.1

2.3.2 Visualizing the output

clusterExperiment also provides many ways to visualize the output of RSEC (or any set of clusterings run in clusterExperiment, as we’ll show below).

2.3.2.1 Visualizing many clusterings

The first such useful visualization is a plot of all of the clusterings together using the plotClusters command. For this visualization, it is useful to change the amount of space on the left of the plot to allow for the labels of the clusterings, so we will reset the mar option in par. We also decrease the axisLine argument that decides the amount of space between the axis and the labels to give more space to the labels (axisLine is passed internally to the line option in axis).

defaultMar<-par("mar")
plotCMar<-c(1.1,8.1,4.1,1.1)
par(mar=plotCMar)
plotClusters(rsecFluidigm,main="Clusters from RSEC", whichClusters="workflow", colData=c("Biological_Condition","Published2"), axisLine=-1)

This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been chosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row.

We also added a colData argument in our call, indicating that we also want to visualize some information about the samples saved in the colData slot (inherited from our original fluidigm object). We chose the columns “Biological_Condition” and “Published2” from colData, which correspond to the original biological condition of the experiment, and the clusters reported in the original paper, respectively. The data from colData (when requested) are always shown at the bottom of the plot.

Notice that some samples are white. This indicates that they have the value -1, meaning they were not clustered. In fact, for many clusterings, there is a large amount of white here. This is likely do to the fact that there are only 65 cells here, and the default parameters of RSEC are better suited for a large number of cells, such as seen in more modern single-cell sequencing experiments. The sequential clustering may be problematic for small numbers of cells.

We can use an alternative version of plotClusters called plotClustersWorkflow that will better emphasize the more final clusterings from the ensemble/concensus step and merging steps (it currently does not allow for showing the colData as well, however – only clustering results).

par(mar=plotCMar)
plotClustersWorkflow(rsecFluidigm)

2.3.2.2 Barplots & contingency tables

We can examine size distribution of a single clustering with the function plotBarplot. By default, the cluster picked will be the primary cluster.

plotBarplot(rsecFluidigm,main=paste("Distribution of samples of",primaryClusterLabel(rsecFluidigm)))

We can also pick a particular intermediate clustering, say our intial consensus clustering before merging.

plotBarplot(rsecFluidigm,whichClusters=c("makeConsensus" ))

We can also compare two specific clusters with a simple barplot using plotBarplot. Here we compare the “makeConsensus” and the “mergeClusters” clusterings.

plotBarplot(rsecFluidigm,whichClusters=c("mergeClusters" ,"makeConsensus"))

Since “makeConsensus” is a partition of “mergeClusters”, there is perfect subsetting within the clusters of “mergeClusters”.

A related plot is to plot a heatmap of the contingency table between two clusterings provided by plotClustersTable. This function plots a heatmap of the results of tableClusters, optionally converting them to proportions using prop.table function based on the parameter margin. Here, we’ll set margin=1, meaning we will show each row (corresponding to a cluster of the mergeCluster clustering), as a proportion – i.e. the grey scale of the heatmap gives (in percentages) how the samples in that row’s cluster are distributed across the clusters of the other clustering, makeConsensus

plotClustersTable(rsecFluidigm,whichClusters=c("mergeClusters" ,"makeConsensus"), margin=1)

Again, since makeConsensus clusters are all completely contained in mergeClusters, this plot has less information than if we were comparing competing clusterings (e.g. different results from mergingClusters, see below). For example, there is nothing on the off-diagonal. But we can still see about how the smaller makeConsensus make up the mergeClusters.

Another version of this plot is given by choosing plotType="bubble", where now the size of the dot at each pair of clusters corresponds to the absolute size of the overlap, and the color scale is again the percentage overlap.

plotClustersTable(rsecFluidigm,whichClusters=c("mergeClusters" ,"makeConsensus"), margin=1,plotType="bubble")

2.3.2.3 Co-Clustering

We can also visualize the proportion of times samples were together in the individual clusterings (i.e. before the consensus clustering):

plotCoClustering(rsecFluidigm,whichClusters=c("mergeClusters","makeConsensus"))

Note that this is not the result from any particular subsampling (which was done repeatedly for each clustering, and those many matrices are not stored), but rather the proportion of times across the final results of the clusterings we ran. The initial consensus clustering in makeConsensus was made based on these proportions and a particular cutoff of the required proportion of times the samples needed to be together.

2.3.2.4 Plot of Hierarchy of Clusters

We can visualize how the initial ensembl cluster in makeConsensus was clustered into a hierarchy and merged to give us the final clustering in mergeClusters:

plotDMar<-c(8.1,1.1,5.1,8.1)
par(mar=plotDMar)
plotDendrogram(rsecFluidigm,whichClusters=c("makeConsensus","mergeClusters"))

As shown in this plot, the individual clusters of the makeConsensus ensembl clustering were hierarchically clustered (hence the note that the dendrogram was made from the makeConsensus clustering), and similar sister clusters were merged if there were not enough gene differences between them.

2.3.2.5 2D plot of clusters

Finally, we can plot a 2-dimensional representation of the data with PCA and color code the samples to get a sense of how the data cluster.

plotReducedDims(rsecFluidigm)

We can also look at a higher number of dimensions (or different dimensions) by changing the parameter ‘whichDims’.

plotReducedDims(rsecFluidigm,whichDims=c(1:4))

In this case we can see that higher dimensions show us a greater amount of separation between the clusters than in just 2 dimensions.

2.4 Rerunning RSEC with different parameters

In the next section, we will describe more about the options we could adjust in RSEC. As an example of a few options, we might, for example, want to change the proportion of co-clustering we required in making our makeConsensus clustering (which used the default of 0.7), or change the proportion of genes that must show differences in order to not merge clusters or the method of deciding. We can call RSEC again on our object rsecFluidigm, and it will not redo the many individual clustering steps which are time intensive (unless we request it to rerun it by including the argument rerunClusterMany=TRUE). We demonstrate this in our next command where we change these choices in the following ways:

  • set the proportion of co-clustering required by the argument consensusProportion=0.6,
  • make the merge cutoff mergeCutoff=0.01
  • decide to use a different method of estimating the proportion differential for merge by setting mergeMethod="Storey" instead of the default (“adjP”).
  • no longer adjust the minimum cluster size and use the default (consensusMinSize=5).

These are common parameters we might frequently want to tweak in RSEC.

rsecFluidigm<-RSEC(rsecFluidigm,isCount=TRUE,consensusProportion=0.6,mergeMethod="JC",mergeCutoff=0.05,stopOnErrors =TRUE)

Notice that we save the output over our original object. This is the standard way to work with the ClusterExperiment objects, since the package’s commands just continues to add the clusterings, without deleting anything from before. In this way, we do not duplicate the actual data in our workspace (which is often large).

We can compare the results of our changes with the whichClusters command to explicitly choose the clusterings we want to plot:

defaultMar<-par("mar")
plotCMar<-c(1.1,8.1,4.1,1.1)
par(mar=plotCMar)
plotClusters(rsecFluidigm,main="Clusters from RSEC", whichClusters=c("mergeClusters.1","makeConsensus.1","mergeClusters","makeConsensus"), colData=c("Biological_Condition","Published2"), axisLine=-1)

The clusterings with the .1 appended to the labels are the previous makeConsensus and mergeClusters clusterings from the default setting (see Rerunning to see how different versions are labeled and stored internally). We can see that we lost several clusters with these options.

In practice, it can be useful to interactively make choices about these parameters by rerunning each the individual steps of the workflow separately and visualizing the changes before moving to the next step, as we do below during our overview of the steps.

2.5 Finding Features related to the clusters

For this last section of the quickstart, we are for convenience going to reload the rsecFluidigm object so that we can go back to the original RSEC results (we could also rerun RSEC with the right parameters).

data(rsecFluidigm)

A common practice after determining a set of clusters is to perform differential gene expression analysis in order to find genes that show the greatest differences amongst the clusters. We would stress that this is purely an exploratory technique, and any p-values that result from this analysis are not valid, in the sense that they are likely to be inflated. This is because the same data was used to define the clusters and to perform differential expression analysis.

Since this is a common task, we provide the function getBestFeatures to perform various kinds of differential expression analysis between the clusters. A common F-statistic between groups can be chosen. However, we find that it is far more informative to do pairwise comparisons between clusters, or one cluster against all, in order to find genes that are specific to a particular cluster. An option for all of these choices is provided in the getBestFeatures function.

The getBestFeatures function uses the DE analysis provided by the limma package (Smyth 2004, @Ritchie:2015fa) or edgeR package (Robinson, Mccarthy, and Smyth 2010). In addition, the getBestFeatures function provides an option to do use the “voom” correction in the limma package (Law et al. 2014) to account for the mean-variance relationship that is common in count data. The tests performed by getBestFeatures are specific contrasts between clustering groups; these contrasts can be retrieved without performing the tests using clusterContrasts, including in a format appropriate for the MAST algorithm.

As mentioned above, there are several types of tests that can be performed to identify features that are different between the clusters which we describe in the section entitled Finding Features related to a Clustering. Here we simply perform all pairwise tests between the clusters.

pairsAllRSEC<-getBestFeatures(rsecFluidigm,contrastType="Pairs",p.value=0.05,
                          number=nrow(rsecFluidigm),DEMethod="edgeR")
head(pairsAllRSEC)
##   IndexInOriginal ContrastName InternalName  Contrast Feature      logFC
## 1            4281      m01-m02    Cl01-Cl02 Cl01-Cl02    PLK4 -12.442306
## 2            4752      m01-m02    Cl01-Cl02 Cl01-Cl02   RBM14 -13.410476
## 3            2142      m01-m02    Cl01-Cl02 Cl01-Cl02 GAPDHP1  -8.526106
## 4            5686      m01-m02    Cl01-Cl02 Cl01-Cl02   SSFA2 -11.504317
## 5            2307      m01-m02    Cl01-Cl02 Cl01-Cl02  GPR180 -11.123672
## 6            3558      m01-m02    Cl01-Cl02 Cl01-Cl02   MYO1B -11.940020
##     logCPM       LR      P.Value    adj.P.Val
## 1 7.735908 39.67752 2.995537e-10 1.545653e-07
## 2 5.816573 36.26129 1.725574e-09 6.814570e-07
## 3 4.634763 33.80920 6.079083e-09 2.017513e-06
## 4 6.711049 30.44903 3.427540e-08 8.158006e-06
## 5 6.478820 30.34149 3.622926e-08 8.508459e-06
## 6 5.796921 30.23114 3.835018e-08 8.947109e-06

2.5.1 Heatmaps of features

We can visualize only these significantly different pair-wise features with a heatmap using the plotHeatmap function.

First we need to identify the genes to plot for the heatmap. Our output pairsAllRSEC has a column IndexInOriginal that identifies the index of the genes in the original data matrix.

However, notice that the same genes can be found in different contrasts, meaning our list of significant genes will not be unique:

# We have duplicates of some genes:
any(duplicated(pairsAllRSEC$Feature))
## [1] TRUE

So some genes are significant for multiple of the tests. For our heatmap, we will descide to show only unique gene values so that they are not plotted multiple times in our heatmap. We will signify which genes to plot with the argument clusterFeaturesData (i.e. which data to use for the rows of the heatmap).

plotHeatmap(rsecFluidigm, whichClusters=c("makeConsensus","mergeClusters"),clusterSamplesData="dendrogramValue",
            clusterFeaturesData=unique(pairsAllRSEC[,"IndexInOriginal"]),
            main="Heatmap of features w/ significant pairwise differences",
            breaks=.99)

Notice that the samples clustered into the -1 cluster (i.e. not assigned) are clustered as an outgroup. This is a choice that is made when the dendrogram of the clusters was created (described below). These samples can also be mixed into the dendrogram at the time of creation (see makeDendrogram)

We can alternatively show a heatmap that will sort the genes by the different contrasts they correspond to using the plotContrastHeatmap function. The genes (rows) will be grouped by what contrast they are with. The option nBlankLines controls the space between the groups of genes from each contrast. We also give the argument whichCluster="primary" to indicate that the contrasts were created with the primary clustering (this means that the legend will put in the names of the clusters rather than their (internal) numeric id).

plotContrastHeatmap(rsecFluidigm, signif=pairsAllRSEC,nBlankLines=40,whichCluster="primary")
## Warning in sort(as.numeric(internalNames)): NAs introduced by coercion

3 Overview of the clustering workflow

We give an overview here of the steps used by the RSEC wrapper and more generally in the clusterExperiment package. The section The clustering workflow goes over these steps and the possible arguments in greater details.

The standard clustering workflow steps are the following:

These clustering steps are done in one function call by RSEC, described above, which is most straightforward usage. However, to understand the parameters available in RSEC it is useful to go through the steps individually. Furthermore RSEC has streamlined the workflow to concentrate on using the workflow with subsampling and sequential strategies, but going through the individual steps demonstrates how the user can make different choices.

Therefore in this section, we will go through these steps, but instead of using the parameters of RSEC that involve subsampling and are more computationally intensive, we will run through the same steps, only using just basic PAM clustering with no subsampling or sequential clustering. This is simply for the purpose of briefly understanding the intermediate steps that RSEC follows. Later sections will go through these steps in more detail and discuss the particular choices embedded in RSEC.

3.1 Step 1: Clustering with clusterMany

clusterMany lets the user quickly pick between many clustering options and run all of the clusterings in one single command. In the quick start we pick a simple set of clusterings based on varying the dimensionality reduction options. The way to designate which options to vary is to give multiple values to an argument.

Here is our call to clusterMany:

ce<-clusterMany(se, clusterFunction="pam",ks=5:10, minSizes=5,
      isCount=TRUE,reduceMethod=c("PCA","var"),nFilterDims=c(100,500,1000),
      nReducedDims=c(5,15,50),run=TRUE)
## 36 parameter combinations, 0 use sequential method, 0 use subsampling method
## Running Clustering on Parameter Combinations...
## done.

In this call to clusterMany we made the follow choices about what to vary:

  • set reduceMethod=c("PCA", "var") meaning run the clustering algorithm trying two different methods for dimensionality reduction: the top principal components and filtering to the top most variable genes
  • For PCA reduction, choose 5,15, and 50 principal components for the reduced data set (set nReducedDims=c(5,15,50))
  • For most variable genes reduction, we choose 100, 500, and 1000 most variable genes (set nFilterDims=c(100,500,1000))
  • For the number of clusters, vary from \(k=5,\ldots,10\) (set ks=5:10)

By giving only a single value to the relative argument, we keep the other possible options fixed, for example:

  • we used ‘pam’ for all clustering (clusterFunction="pam")
  • we set minSizes=5. This is an argument that allows the user to set a minimum required size and clusters of size less than that value will be ignored and samples assigned to them given the unassigned value of -1. The default of 1 would mean that this option is not used.

We also set isCount=TRUE to indicate that our input data are counts. This means that operations for clustering and visualizations will internally transform the data as \(log_2(x+1)\) (We could have alternatively explicitly set a transformation by giving a function to the transFun argument, for example if we wanted \(\sqrt(x)\) or \(log(x+\epsilon)\) or just identity).

We can visualize the resulting clusterings using the plotClusters command as we did for the RSEC results.

defaultMar<-par("mar")
par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", colData=c("Biological_Condition","Published2"), axisLine=-1)

We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically.

Notice that again some samples are white (i.e the value -1, meaning they were not clustered). This is from our choices to require at least 5 samples to make a cluster.

We have set whichClusters="workflow" to only plot clusters created from the workflow. Right now that’s all there are anyway, but as commands get rerun with different options, other clusterings can build up in the object (see discussion in this section about how multiple calls to workflow are stored). So setting whichClusters="workflow" means that we are only going to see our most recent calls to the workflow (so far we only have the 1 step, which is the clusterMany step). We seen already that whichClusters can be set to limit to specific clusterings or specific steps in the workflow .

Using whichClusters to reorder the clusterings in plotClusters We can also give to the whichClusters an vector argument corresponding to the indices of clusters stored in the ClusterExperiment object. This can allow us to show the clusters in a different order. Here we’ll pick an order which corresponds to varying the number of dimensions, rather than varying \(k\).

We first find the parameters for each clustering using the getClusterManyParams

cmParams<-getClusterManyParams(ce)
head(cmParams)
##                                                     clusteringIndex  k
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA                1  5
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA                2  6
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA                3  7
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA                4  8
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA                5  9
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA               6 10
##                                                     reduceMethod nReducedDims
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           PCA            5
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           PCA            5
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           PCA            5
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           PCA            5
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           PCA            5
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA          PCA            5
##                                                     nFilterDims
## k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           NA
## k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           NA
## k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           NA
## k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           NA
## k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA           NA
## k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA          NA

getClusterManyParams returns the parameter values, as well as the index of the corresponding clustering (i.e. what column it is in the matrix of clusterings). It is important to note that the index will change if we add additional clusterings, as we will do later, in which case we’d need to recall getClusterManyParams.

We will set an order to show first the PCA, ordered by number of dimensions, then the Var, ordered by number of diminsions

ord<-order(cmParams[,"reduceMethod"],cmParams[,"nReducedDims"])
ind<-cmParams[ord,"clusteringIndex"]
par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters=ind, colData=c("Biological_Condition","Published2"), axisLine=-1)

We see that the order in which the clusters are given to plotClusters changes the plot greatly, since the order of each subsequent clustering depends on the order of the line above it.

The labels shown are those given automatically by clusterMany (and can be accessed with clusterLabels) but can be a bit much for plotting. We choose to remove “Features” as being too wordy:

clOrig<-clusterLabels(ce)
clOrig<-gsub("Features","",clOrig)
par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters=ind, clusterLabels=clOrig[ind], colData=c("Biological_Condition","Published2"), axisLine=-1)

We could also permanently assign new labels in our ClusterExperiment object if we prefer, for example to be more succinct, by changing the clusterLabels of the object (clusterLabels(ce)<- ...).

There are many different options for how to run plotClusters discussed in in the detailed section on plotClusters, but for now, this plot is good enough for a quick visualization.

3.2 Step 2: Find a consensus with makeConsensus

To find a consensus clustering across the many different clusterings created by clusterMany the function makeConsensus can be used next.

ce<-makeConsensus(ce,proportion=1)

The proportion argument indicates the minimum proportion of times samples should be with other samples in the cluster they are assigned to in order to be clustered together in the final assignment. Notice we get a warning that we did not specify any clusters to combine, so it is using the default – those from the clusterMany call.

If we look at the clusterMatrix of the returned ce object, we see that the new cluster from makeConsensus has been added to the existing clusterings. This is the basic strategy of all of these functions in this package. Any clustering function that is applied to an existing ClusterExperiment object adds the new clustering to the set of existing clusterings, so the user does not need to keep track of past clusterings and can easily compare what has changed.

We can again run plotClusters, which will now also show the result of makeConsensus. Instead, we’ll use plotClustersWorkflow which is nicer for looking specifically at the results of makeConsensus

head(clusterMatrix(ce)[,1:3])
##            makeConsensus k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA
## SRR1275356            -1                                                  4
## SRR1275251            -1                                                  5
## SRR1275287            -1                                                  1
## SRR1275364            -1                                                  4
## SRR1275269            -1                                                  1
## SRR1275263            -1                                                  5
##            k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA
## SRR1275356                                                  3
## SRR1275251                                                  5
## SRR1275287                                                  4
## SRR1275364                                                  3
## SRR1275269                                                  6
## SRR1275263                                                  5
par(mar=plotCMar)
plotClustersWorkflow(ce)

The choices of proportion=1 in makeConsensus is not usually a great choice, and certainly isn’t helpful here. The clustering from the default makeConsensus leaves most samples unassigned (white in the above plot). This is because we requires samples to be in the same cluster in every clustering in order to be assigned to a cluster together. This is quite stringent. We can vary this by setting the proportion argument to be lower. Explicit details on how makeConsensus makes these clusters are discussed in the section on makeConsensus.

So let’s label the one we found as “makeConsensus,1” and then create a new one. (Making or changing the label to an informative label will make it easier to keep track of this particular clustering later, particularly if we make multiple calls to the workflow).

wh<-which(clusterLabels(ce)=="makeConsensus")
if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"makeConsensus,1"

Now we’ll rerun makeConsensus with proportion=0.7. This time, we will give it an informative label upfront in our call to makeConsensus.

ce<-makeConsensus(ce,proportion=0.7,clusterLabel="makeConsensus,0.7")
par(mar=plotCMar)
plotClustersWorkflow(ce)

We see that more clusters are detected. Those that are still not assigned a cluster from makeConsensus clearly vary across the clusterings as to whether the samples are clustered together or not. Varying the proportion argument will adjust whether some of the unclustered samples get added to a cluster. There is also a minSize parameter for makeConsensus, with the default of minSize=5. We could reduce that requirement as well and more of the unclustered samples would be grouped into a cluster. Here, we reduce it to minSize=3 (we’ll call this “makeConsensus,final”). We’ll also choose to show all of the different makeConsensus results in our call to plotClustersWorkflow:

ce<-makeConsensus(ce,proportion=0.7,minSize=3,clusterLabel="makeConsensus,final")
par(mar=plotCMar)
plotClustersWorkflow(ce,whichClusters=c("makeConsensus,final","makeConsensus,0.7","makeConsensus,1"),main="Min. Size=3")

As we did before for RSEC results, we can also visualize the proportion of times these clusters were together across these clusterings (this information was made and stored in the ClusterExperiment object when we called makeConsensus provided that proportion argument is <1):

plotCoClustering(ce)

This visualization can help in determining whether to change the value of proportion (though see makeConsensus for how -1 assignments affect makeConsensus).

3.3 Step 3: Merge clusters together with makeDendrogram and mergeClusters

Once you start varying the parameters, is not uncommon in practice to create forty or more clusterings with clusterMany. In which case the results of makeConsensus can often result in too many small clusters. We might wonder if they are necessary or could be logically combined together. We could change the value of proportion in our call to makeConsensus. But we have found that it is often after looking at the clusters, for example with a heatmap, and how different they look on individual genes that we best make this determination, rather than the proportion of times they are together in different clustering routines.

For this reason, we often find the need for an additional clustering step that merges clusters together that are not different, based on running tests of differential expression between the clusters found in makeConsensus. This is done by the function mergeClusters. We often display and use both sets of clusters side-by-side (that from makeConsensus and that from mergeClusters).

mergeClusters needs a hierarchical clustering of the clusters in order to merge clusters; it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function makeDendrogram makes such a hierarchy between clusters (by applying hclust to the medoids of the clusters). Because the results of mergeClusters are so dependent on that hierarchy, we require the user to call makeDendrogram rather than calling it automatically internally. This is because different options in makeDendrogram can affect how the clusters are hierarchically ordered, and we want to encourage the user make these choices.

As an example, here we use the 500 most variable genes to make the cluster hierarchy (note we can make different choices here than we did in the clustering).

ce<-makeDendrogram(ce,reduceMethod="var",nDims=500)
plotDendrogram(ce)

Notice that the relative sizes of the clusters are shown as well.

We can see that clusters 1 and 3 are most closely related, at least in the top 500 most variable genes.

Notice I don’t need to make the dendrogram again, because it’s saved in ce. If we look at the summary of ce, it now has ‘makeDendrogram’ marked as ‘Yes’.

ce
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: makeConsensus 
## Primary cluster label: makeConsensus,final 
## Table of clusters (of primary clustering):
##  -1 c01 c02 c03 c04 c05 c06 
##   8  15  14   9   8   8   3 
## Total number of clusterings: 39 
## Dendrogram run on 'makeConsensus,final' (cluster index: 1)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? No

Now we are ready to actually merge clusters together. We run mergeClusters that will go up this hierarchy and compare the level of differential expression (DE) in each pair. In other words, if we focus on the left side of the tree, DE tests are run, between 1 and 3, and between 6 and 8. If there is not enough DE between each of these (based on a cutoff that can be set by the user), then clusters 1 and 3 and/or 6 and 8 will be merged. And so on up the tree.

If the dataset it not too large, is can be useful to first run mergeClusters without actually saving the results so as to preview what the final clustering will be (and perhaps to help in setting the cutoff).

mergeClusters(ce,mergeMethod="adjP",DEMethod="edgeR",plotInfo="mergeMethod")

Notice that unlike our RSEC calls, we have to explicitly choose the DE method that is used in our call to mergeClusters. RSEC chooses the method by default based on the value of isCount argument (but the user can set it in RSEC with mergeDEMethod argument). Since our data is counts, we choose the DE method to be edgeR (which is also what RSEC chooses by default since we set isCount=TRUE).

The default cutoff is cutoff=0.1, meaning those nodes with less than 10% of genes estimated to be differentially expressed between its two children groupings of samples are merged. This is pretty stringent, and as we see it results in no clusterings being kept.

However, the plot tells us the estimate of that proportion for each node. We can decide on a better cutoff using that information. We choose instead cutoff=0.01:

ce<-mergeClusters(ce,mergeMethod="adjP",DEMethod="edgeR",cutoff=0.05)

Notice that now the plot has given the estimates from all of the methods (the default set by plotInfo argument), not just the adjP method. But the dotted lines of the dendrogram indicate the merging performed by the actual choices in merging made in the command (mergeMethod="adjP" and cutoff=0.01).

It can be interesting to visualize the clusterings both with the plotClustersWorkflow and the co-Proportion plot (a heatmap of the proportion of times the samples co-clustered):

par(mar=plotCMar)
plotClustersWorkflow(ce,whichClusters="workflow") 

plotCoClustering(ce,whichClusters=c("mergeClusters","makeConsensus"),
                 colData=c("Biological_Condition","Published2"),annLegend=FALSE)

Notice that mergeClusters combines clusters based on the actual values of the features, while the coClustering plot shows how often the samples clustered together. It is not uncommon that mergeClusters will merge clusters that don’t look “close” on the coClustering plot. This can be due to just the choices of the hierarchical clustering between the clusters, but can also be because the two merged clusters are not often confused for each other across the clustering algorithms, yet still don’t have strong differences on individual genes. This can be the case especially when the clustering is done on reduced PCA space, where an accumulation of small differences might consistently separate the samples (so that comparatively few clusterings are “confused” as to the samples). But because the differences are not strong on individual genes, mergeClusters combines them. These are ultimately different criteria.

Finally, we can do a heatmap visualizing this final step of clustering.

plotHeatmap(ce,clusterSamplesData="dendrogramValue",breaks=.99,
            colData=c("Biological_Condition", "Published1", "Published2"))

By choosing “dendrogramValue” for the clustering of the samples, we will be showing the clusters according to the hierarchical ordering of the clusters found by makeDendrogram. The argument breaks=0.99 means that the last color of the heatmap spectrum will be forced to be the top 1% of the data (rather than evenly spaced through the entire range of values). This can be helpful in making sure that rare extreme values in the upper range do not absorb too much space in the color spectrum. There are many more options for plotHeatmap, some of which are discussed in the section on plotHeatmap.

3.4 RSEC

The above explanation follows the simple example of PAM. The original RSEC result called RSEC which calls these steps internally. Many of the options described above can be set through a call to RSEC, but some are restricted for simplicity. A detail explanation of the differences can be found in the section RSEC below. But briefly, the following RSEC command, which uses most of the arguments of RSEC:

rsecOut<-RSEC(se, isCount=TRUE, reduceMethod="PCA", nReducedDims=c(50,10), k0s=4:15,
        alphas=c(0.1,0.2,0.3),betas=c(0.8,0.9), minSizes=c(1,5), clusterFunction="hierarchical01",
        consensusProportion=0.7, consensusMinSize=5,
        dendroReduce="mad",dendroNDims=500,
        mergeMethod="adjP",mergeCutoff=0.05,
        )

would be equivalent to the following individual steps:

ce<-clusterMany(se,ks=4:15,alphas=c(0.1,0.2,0.3),betas=c(0.8,0.9),minSizes=c(1,5),
    clusterFunction="hierarchical01", sequential=TRUE,subsample=TRUE,
         reduceMethod="PCA",nFilterDims=c(50,10),isCount=TRUE)
ce<-makeConsensus(ce, proportion=0.7, minSize=5)
ce<-makeDendrogram(ce,reduceMethod="mad",nDims=500)
ce<-mergeClusters(ce,mergeMethod="adjP",DEMethod="edgeR",cutoff=0.05,plot=FALSE)

Note that this mean the RSEC function always calls sequential and subsampling.

4 ClusterExperiment Objects

The ce object that we created by calling clusterMany is a ClusterExperiment object. The ClusterExperiment class is used by this package to keep the data and the clusterings together. It inherits from SingleCellExperiment class (which inherits from SummarizedExperiment) which means the data and colData and other information orginally in the fluidigm object are retained and can be accessed with the same functions as before. The ClusterExperiment object additionally stores clusterings and information about the clusterings along side the data. This helps keep everything together, and like the original SummarizedExperiment object we started with, allows for simple things like subsetting to a reduced set of subjects and being confident that the corresponding clusterings, colData, and so forth are similarly subset.

Typing the name at the control prompt results in a quick summary of the object.

ce
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 m01 m02 m03 m04 
##   8  27  14   8   8 
## Total number of clusterings: 40 
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes

This summary tells us the total number of clusterings (40), and gives some indication as to what parts of the standard workflow have been completed and stored in this object. It also gives information regarding the primaryCluster of the object. The primaryCluster is just one of the clusterings that has been chosen to be the “primary” clustering, meaning that by default various functions will turn to this clustering as the desired clustering to use. clusterMany arbitrarily sets the ‘primaryCluster’ to the first one, and each later step of the workflow sets the primary index to the most recent, but the user can set a specific clustering to be the primaryCluster with primaryClusterIndex. Often, if a function is not given a specific clustering (usually via an option whichCluster or whichClusters) the “primary” cluster is taken by default.

There are also additional commands to access the clusterings and their related information (type help("ClusterExperiment-methods") for more).

The cluster assignments are stored in the clusterMatrix slot of ce, with samples on the rows and different clusterings on the columns. We saw that we can look at the cluster matrix and the primary cluster with the commands clusterMatrix and primaryCluster

head(clusterMatrix(ce))[,1:5]
##            mergeClusters makeConsensus,final makeConsensus,0.7 makeConsensus,1
## SRR1275356            -1                  -1                -1              -1
## SRR1275251            -1                  -1                -1              -1
## SRR1275287             1                   6                -1              -1
## SRR1275364             4                   5                 5              -1
## SRR1275269            -1                  -1                -1              -1
## SRR1275263            -1                  -1                -1              -1
##            k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA
## SRR1275356                                                  4
## SRR1275251                                                  5
## SRR1275287                                                  1
## SRR1275364                                                  4
## SRR1275269                                                  1
## SRR1275263                                                  5
primaryCluster(ce)
##  [1] -1 -1  1  4 -1 -1  1  1  1  1  2  2  4  2  1  3  1  1  1  1  2  2  2  1 -1
## [26]  1  1  1  3  2  2  4  3  1  1 -1  1  2 -1  3  1  1  1  4  1  1  4 -1  1  2
## [51]  2  2  1  2  1  3  4  1  1  4  3  3  2  4  3

Remember that we made multiple calls to makeConsensus: only the last such call will be shown when we use whichClusters="workflow" in our plotting (see this section for a discussion of how these repeated calls are handled.)

clusterLabels gives the column names of the clusterMatrix; clusterMany has given column names based on the parameter choices, and later steps in the workflow also give a name (or allow the user to set them).

head(clusterLabels(ce),10)
##  [1] "mergeClusters"                                      
##  [2] "makeConsensus,final"                                
##  [3] "makeConsensus,0.7"                                  
##  [4] "makeConsensus,1"                                    
##  [5] "k=5,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA" 
##  [6] "k=6,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA" 
##  [7] "k=7,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA" 
##  [8] "k=8,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA" 
##  [9] "k=9,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA" 
## [10] "k=10,reduceMethod=PCA,nReducedDims=5,nFilterDims=NA"

As we’ve seen, the user can also change these labels.

clusterTypes on the other hand indicates what call made the clustering. Unlike the labels, it is wise to not change the values of clusterTypes unless you are sure of what you are doing because these values are used to identify clusterings from different steps of the workflow.

head(clusterTypes(ce),10)
##  [1] "mergeClusters"   "makeConsensus"   "makeConsensus.2" "makeConsensus.1"
##  [5] "clusterMany"     "clusterMany"     "clusterMany"     "clusterMany"    
##  [9] "clusterMany"     "clusterMany"

The information that was in the original fluidigm object has also been preserved, like colData that contains information on each sample.

colData(ce)[,1:5]
## DataFrame with 65 rows and 5 columns
##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
##            <numeric> <numeric> <numeric> <numeric> <numeric>
## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
## SRR1275269  10729700   7806230   72.7536   50.4285 0.0204349
## ...              ...       ...       ...       ...       ...
## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522

Another important slot in the ClusterExperiment object is the clusterLegend slot. This consists of a list, one element per column (or clustering) of clusterMatrix, that gives colors and names to each cluster within a clustering.

length(clusterLegend(ce))
## [1] 40
clusterLegend(ce)[1:2]
## $mergeClusters
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "m01"
## [3,] "2"        "#33A02C" "m02"
## [4,] "3"        "#FF7F00" "m03"
## [5,] "4"        "#6A3D9A" "m04"
## 
## $`makeConsensus,final`
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "c01"
## [3,] "2"        "#33A02C" "c02"
## [4,] "3"        "#B15928" "c03"
## [5,] "4"        "#FF7F00" "c04"
## [6,] "5"        "#6A3D9A" "c05"
## [7,] "6"        "#A6CEE3" "c06"

We can see that each element of clusterLegend consists of a matrix, with number of rows equal to the number of clusters in the clustering. The columns store information about that cluster. clusterIds is the internal id (integer) used in clusterMatrix to identify the cluster, name is a name for the cluster, and color is a color for that cluster. color is used in plotting and visualizing the clusters, and name is an arbitrary character string for a cluster. They are automatically given default values when the ClusterExperiment object is created, but we will see under the description of visualization methods how the user might want to manipulate these for better plotting results.

We can assign new values with a simple assignment operator, but we also provide the functions renameClusters and recolorClusters to help do this. Here we change the internal cluster names of the first clustering from lowercase to uppercase “M” using the function renameClusters:

newName<-gsub("m","M",clusterLegend(ce)[[1]][,"name"])
renameClusters(ce,whichCluster=1,value=newName)
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 M01 M02 M03 M04 
##   8  27  14   8   8 
## Total number of clusterings: 40 
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes
print(ce)
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 m01 m02 m03 m04 
##   8  27  14   8   8 
## Total number of clusterings: 40 
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes

Note that if you choose to not use these functions and instead replace the whole matrix (e.g. clusterLegend(ce)[[1]]<- ...) you should be careful not assign new values to clusterIds column, as the entries must exactly correspond to the internal ids of the clustering stored in the clustering matrix.

4.1 Subsetting ClusterExperiment objects

Like SummarizedExperiment or SingleCellExperiment classes, standard subsetting of a ClusterExeriment object will result in a new ClusterExperiment object with all of the relevant parts of the data similarly subsetted.

smallCe<-ce[1:5,1:10]
smallCe
## class: ClusterExperiment 
## dim: 5 10 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 m01 m04 
##   4   5   1 
## Total number of clusterings: 40 
## No dendrogram present
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? No 
## mergeClusters run? Yes

Notice from looking at the clusterMatrix below that the clustering results have been subset and that after subsetting, the internal cluster ids may change (because they are required to be consecutive).

clusterMatrix(smallCe)[,1:4]
##            mergeClusters makeConsensus,final makeConsensus,0.7 makeConsensus,1
## SRR1275356            -1                  -1                -1              -1
## SRR1275251            -1                  -1                -1              -1
## SRR1275287             1                   4                -1              -1
## SRR1275364             2                   3                 3              -1
## SRR1275269            -1                  -1                -1              -1
## SRR1275263            -1                  -1                -1              -1
## SRR1275338             1                   2                 2              -1
## SRR1274117             1                   1                 1               1
## SRR1274089             1                   1                 1               1
## SRR1274125             1                   1                 1               1
clusterMatrix(ce)[1:10,1:4]
##            mergeClusters makeConsensus,final makeConsensus,0.7 makeConsensus,1
## SRR1275356            -1                  -1                -1              -1
## SRR1275251            -1                  -1                -1              -1
## SRR1275287             1                   6                -1              -1
## SRR1275364             4                   5                 5              -1
## SRR1275269            -1                  -1                -1              -1
## SRR1275263            -1                  -1                -1              -1
## SRR1275338             1                   3                 3              -1
## SRR1274117             1                   1                 1               1
## SRR1274089             1                   1                 1               1
## SRR1274125             1                   1                 1               1

However, the names (and colors) of each cluster should stay the same, which we can see by looking at the clusterLegend information

clusterLegend(smallCe)[["mergeClusters"]]
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "m01"
## [3,] "2"        "#6A3D9A" "m04"
clusterLegend(ce)[["mergeClusters"]]
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "m01"
## [3,] "2"        "#33A02C" "m02"
## [4,] "3"        "#FF7F00" "m03"
## [5,] "4"        "#6A3D9A" "m04"

However subsetting will lose some saved information. In particular, the hierarchy of the clusters that you created with makeDendrogram will be deleted in the new object, as will any saved information about the merging in the mergeClusters step (since that depended on the dendrogram which is now gone).

nodeMergeInfo(ce)
##    NodeId                  Contrast isMerged mergeClusterId    Storey        PC
## 1 NodeId1 (X2+X4+X5)/3-(X1+X3+X6)/3    FALSE             NA 0.4335833 0.3619738
## 2 NodeId2              X2-(X4+X5)/2    FALSE             NA 0.3526666 0.3043091
## 3 NodeId3              X1-(X3+X6)/2     TRUE              1 0.3535153 0.2884220
## 4 NodeId4                     X3-X6     TRUE             NA 0.2601500 0.1922908
## 5 NodeId5                     X4-X5    FALSE             NA 0.2991937 0.2486148
##         adjP locfdr JC
## 1 0.09308247     NA NA
## 2 0.07879474     NA NA
## 3 0.04894610     NA NA
## 4 0.01556090     NA NA
## 5 0.05672655     NA NA
nodeMergeInfo(smallCe)
## NULL

The actual clustering created in the mergeClusters step, however, are retained as we’ve seen above.

Another useful type of subsetting can be to subset to only samples contained in a set of particular clusters within a clustering. This can be useful, for example, if you want to visualize the data in only those clusters. The function subsetByCluster allows you to do this, and it returns a new ClusterExperiment object with only those samples. The required input is to identify the values of the clusters you want to keep (by default matching to the clusters’ names)

subCe<-subsetByCluster(ce,whichCluster="mergeClusters",clusterValue=c("m1","m2"))
subCe
## class: ClusterExperiment 
## dim: 7069 0 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):< table of extent 0 >
## Total number of clusterings: 40 
## No dendrogram present
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? No 
## mergeClusters run? Yes

The object subCe now can be used for visualizing, or any other analysis.

This kind of subsetting can also be useful in comparing clusterings, where the user might want to subset to all of the samples assigned to Cluster 1 in one clustering and then see what clusters that corresponds to in the other clusterings.

4.2 Samples not assigned to a cluster (Negative Valued Cluster Assignments)

The different clusters are stored as consecutive integers, with ‘-1’ and ‘-2’ having special meaning.

Unassigned Samples (-1) ‘-1’ refers to samples that were not clustered by the clustering algorithm. In our example, we removed clusters that didn’t meet specific size criterion, so they were assigned ‘-1’.

Missing from Clustering Run (-2) ‘-2’ is for samples that were not included in the original input to the clustering. This is useful if, for example, you cluster on a subset of the samples, and then want to store this clustering with the clusterings done on all the data. You can create a vector of clusterings that give ‘-2’ to the samples not originally used and then add these clusterings to the ce object manually with addClusters.

We can also wish to go back and assign these samples to the best cluster possible. This can be done with the assignUnassigned function. This function will assign the samples with negative-valued cluster ids to the cluster to which the sample is the closest. Closest is determined by the euclidean distance between an unassigned sample and the median value of the samples assigned to the cluster. The data used by the function to determine the euclidean distances and medians of clusters can be determine by arguments like we see in RSEC and other functions.

ce<-assignUnassigned(ce,whichCluster="mergeClusters",reduceMethod="PCA",nDim=50,makePrimary=FALSE, filterIgnoresUnassigned=TRUE,clusterLabel="mergeClusters_AssignToCluster")
tableClusters(ce,whichCluster="mergeClusters_AssignToCluster")
## mergeClusters_AssignToCluster
##  1  2  3  4 
## 30 15 10 10

Note that we chose makePrimary=FALSE (not the default) so that our original mergeClusters clustering remains the primary one, and doesn’t affect our future calls.

plotClusters(ce,whichCluster=c("mergeClusters_AssignToCluster","mergeClusters"))

You can also create a new object where all of the samples that are not assigned are removed with the removeUnassigned function. This is just a special case of subsetByCluster (above) for the special case of subsetting down to those samples assigned to any cluster.

4.3 Dimensionality reduction and SingleCellExperiment Class

There are two varieties of dimensionality reduction supported in clusterExperiment package.

  1. reducing to a subset of features/genes based on the values of a filter statistic calculated for each gene or
  2. creation of a smaller number of new features that are functions of the original features, i.e. not a simple selection of existing variables, but rather create new variables to represent the data

For simplicity, we’ll refer to the first as filtering of the data and second as a dimensionality reduction. This is because in the first case, the reduced data set can be quickly recreated by subseting the original data so long as the per-gene statistics have been saved. This means only a single vector of the length of the number of genes needs to be stored for the first type of dimensionality reduction (filtering) while the second kind requires saving a matrix with a value for each observation for each new variable.

ClusterExperiment inherits from the standard Bioconductor SingleCellExperiment class. Briefly, the SingleCellExperiment class extends the SummarizedExperiment class to give a structure for saving the reduced matrices from dimensionality reductions as we described above. They are saved in the slot reducedDims, which is a SimpleList of datasets that have the same number of observations as the original data in the assay slot, but reduced features (?SingleCellExperiment). This gives a unified way to save the results of applying a dimensionality reduction method of the second type; the package also gives helper functions to access them, etc. Multiple such dimensionality reductions can be stored since it is a list, and the user gives them names, e.g. “PCA” or “tSNE”.

ClusterExperiment uses the slot reducedDims to both save the results of dimensionality reductions if they are calculated and and also to reuse them if the necessary ones have already been created. This allows clusterExperiment to make use of any dimensionality reduction method so long as the user saves it in the appropriate slot in a SingleCellExperiment object. The user can also choose like before to have the function (like clusterMany) do the dimensionality reduction (i.e. PCA) internally. The difference is that now the results of the PCA will be stored in the appropriate slot so that they will not need to be recalculated in the future.

We also added in clusterExperiment package a similar procedure for storing the filtering statistics (i.e. statistics calculated on each gene). An example is the the variance across observations, calculated for every gene. clusterExperiment when calculating statistics (like var or mad) will add the per-gene value of the statistic as a column of the rowData of the ClusterExperiment object. Similarly, if the user has already calculated a per-gene statistic and saved it as a column in the rowData slot, this user-defined statistic can be used for filtering. This means that the user is not limited to the built-in filtering functions provided in clusterExperiment.

The functions makeReducedDims and makeFilterStats calculate the dimensionality reduction and filtering statistics, respectively, provided by clusterExperiment and store them in the appropriate slot. To see the current list of built-in functions:

listBuiltInReducedDims()
## [1] "PCA"
listBuiltInFilterStats()
## [1] "var"    "abscv"  "mad"    "mean"   "iqr"    "median"

5 Visualizing the data

In the quick start (above)[#visualsummary] we ran through many plots available for visualizing the data in conjunction with the clusterings. In this section we do not go through all of these plots, but just highlight the details of some plots which are more complicated.

5.1 Cluster Alignment plot with plotClusters

We demonstrated during the quick start that we can visualize the alignment of multiple clusterings via a Cluster Alignment plot implemented in plotClusters or plotClustersWorkflow command. Since plotClustersWorkflow calls the plotClusters command and passes the arguments of plotClusters onward, we will focus mainly on the main plotClusters command, with the understanding that most of these arguments work for both.

Here is our basic call to plotClusters:

par(mar=plotCMar)
plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", 
             axisLine=-1)

We have seen that we can get very different plots depending on how we order the clusterings, and what clusterings are included. The argument whichClusters allows the user to choose different clusterings or provide an explicit ordering of the clusterings. For plotClustersWorkflow, whichClusters indicates the clusterings that are “highlighted” by being drawn separately from the results of clusterMany.

whichClusters can take either a single character value, or a vector of either characters or indices. If whichClusters matches either “all” or “workflow”, then the clusterings chosen are either all, or only those from the most recent calls to the workflow functions. Choosing “workflow” removes from the visualization both user-defined clusterings and also previous calls to the workflow that have since been rerun. Setting whichClusters="workflow" can be a useful if you have called a method like makeConsensus several times, as we did, only with different parameters. All of those runs are saved (unless eraseOld=TRUE), but you may not want to plot them.

If whichClusters is a character that is not one of these designated values, the entries should match a “clusterType” value (like “clusterMany”) or a “clusterLabel” value (with exact matching). Alternatively, the user can specify numeric indices corresponding to the columns of clusterMatrix that provide the order of the clusters.

par(mar=plotCMar)
plotClusters(ce,whichClusters="clusterMany",
               main="Only Clusters from clusterMany",axisLine=-1)

We can also add to our plot (categorical) information on each of our subjects from the colData of our SummarizedExperiment object (which is also retained in our ClusterExperiment object). This can be helpful to see if the clusters correspond to other features of the samples, such as sample batches. Here we add the values from the columns “Biological_Condition” and “Published2” that were present in the fluidigm object and given with the published data.

par(mar=plotCMar)
plotClusters(ce,whichClusters="workflow", colData=c("Biological_Condition","Published2"), 
               main="Workflow clusters plus other data",axisLine=-1)

This options of plotting the data in the colData, however, is not currently available with plotClustersWorkflow which only plots clusterings.

5.1.1 Manipulations of colors

In this section we will talk about a number of related ways to manipulate the colors assigned to clusters in conjunction with the plotClusters command.

We will turn off the plotting of the cluster labels to make the plot less cluttered using the option plot=FALSE.

5.1.1.1 Saving Assignment of colors from plotClusters

plotClusters invisibly returns a ClusterExperiment object. In our earlier calls to plotCluster, this would be the same as the input object and so there is no reason to save it. However, the alignment and color assignments created by plotClusters can be requested to be saved into the appropriate slots of the ClusterExperiment object in order to save the color and alignments of samples. This is done via the resetNames, resetColors and resetOrderSamples arguments. If any of these are set to TRUE, then the object returned will be different than those of the input.

Specifically, if resetColors=TRUE the colorLegend of the returned object will be changed so that the colors assigned to each cluster will be as were shown in the plot (and indeed this is done automatically by mergeClusters so that the makeConsensus and mergeClusters steps will have aligned color assignments). Similarly, if resetNames=TRUE the names of the clusters will be changed to be integer values, but now those integers will be aligned to try to be the same across clusters (and therefore will not be consecutive integers, which is why these are saved as names for the clusters and not the internal clusterIds). If resetOrderSamples=TRUE, then the order of the samples shown in the plot will be similarly saved in the slot orderSamples.

As an example, we will make a call to plotClusters, but now ask to reset everything to match this clusterAlignment.

First let’s look at what the object’s default colors are for the first two clusterings, accessed by clusterLegend function:

clusterLegend(ce)[c("mergeClusters","makeConsensus,final")]
## $mergeClusters
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "m01"
## [3,] "2"        "#33A02C" "m02"
## [4,] "3"        "#FF7F00" "m03"
## [5,] "4"        "#6A3D9A" "m04"
## 
## $`makeConsensus,final`
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "c01"
## [3,] "2"        "#33A02C" "c02"
## [4,] "3"        "#B15928" "c03"
## [5,] "4"        "#FF7F00" "c04"
## [6,] "5"        "#6A3D9A" "c05"
## [7,] "6"        "#A6CEE3" "c06"

The plotClusterLegend creates a quick legend plot of this information for a single clustering:

plotClusterLegend(ce,whichCluster="makeConsensus,final")

Now, we’ll run plotClusters and save the new colors. We’ll save this as a different object so that this is not a permanent change for the rest of the vignette, though in practice we would usually overwrite the existing ce to save memory on our computer and not have many versions floating around.

ce_temp<-plotClusters(ce,whichClusters="workflow", colData=c("Biological_Condition","Published2"), 
               main="Cluster Alignment of Workflow Clusters",clusterLabels=FALSE, axisLine=-1, resetNames=TRUE,resetColors=TRUE,resetOrderSamples=TRUE)

Now, the clusterLegend slot of the object no longer has the default color/name assignments, but it has names and colors that match across the clusters. Notice, that this means the prefix “m” or “c” that was previously given to distinguish the makeConsensus result from the mergeClusters result is now gone (the user could manually add them by changing the values in the clusterLegend). Instead, there are names that “match up” the clusters across the different clusterings.

clusterLegend(ce_temp)[c("mergeClusters","makeConsensus,final")]
## $mergeClusters
##      clusterIds color     name
## [1,] "-1"       "white"   "-1"
## [2,] "1"        "#E31A1C" "1" 
## [3,] "2"        "#1F78B4" "2" 
## [4,] "3"        "#33A02C" "3" 
## [5,] "4"        "#FF7F00" "4" 
## 
## $`makeConsensus,final`
##      clusterIds color     name
## [1,] "-1"       "white"   "-1"
## [2,] "1"        "#E31A1C" "1" 
## [3,] "2"        "#1F78B4" "2" 
## [4,] "3"        "#B15928" "6" 
## [5,] "4"        "#33A02C" "3" 
## [6,] "5"        "#FF7F00" "4" 
## [7,] "6"        "#A6CEE3" "7"

5.1.1.2 Manual Assignment of colors

A similar setting can be that we have colors we want to manually set for a particular cluster, but we want to have the other clusterings get aligned to the colors of that clustering. We can use plotClusters to assign colors to the other clusterings so that the other clusterings inherit the colors of the cluster of interest.

Let’s manually set the colors for the “mergeClusters” clustering. We’ll again create a (new) ce_temp object so again we don’t overwrite the previous colors for the rest of the vignette. Again, the color information is accessed with the clusterLegend command:

ce_temp<-ce
clusterLegend(ce_temp)[["mergeClusters"]]
##      clusterIds color     name 
## [1,] "-1"       "white"   "-1" 
## [2,] "1"        "#1F78B4" "m01"
## [3,] "2"        "#33A02C" "m02"
## [4,] "3"        "#FF7F00" "m03"
## [5,] "4"        "#6A3D9A" "m04"

We will just assign new colors to the color column with the recolorClusters function. We can also give them new names too with renameClusters.

newColors<-c("white","blue","green","cyan","purple")
newNames<-c("Not assigned","Cluster1","Cluster2","Cluster3","Cluster4")
#note we make sure they are assigned to the right cluster Ids by giving the 
#clusterIds as names to the new vectors
names(newColors)<-names(newNames)<-clusterLegend(ce_temp)[["mergeClusters"]][,"name"]
renameClusters(ce_temp,whichCluster="mergeClusters",value=newNames)
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##     Cluster1     Cluster2     Cluster3     Cluster4 Not assigned 
##           27           14            8            8            8 
## Total number of clusterings: 41 
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes
recolorClusters(ce_temp,whichCluster="mergeClusters",value=newColors)
## class: ClusterExperiment 
## dim: 7069 65 
## reducedDimNames: PCA 
## filterStats: var var_makeConsensus.final 
## -----------
## Primary cluster type: mergeClusters 
## Primary cluster label: mergeClusters 
## Table of clusters (of primary clustering):
##  -1 m01 m02 m03 m04 
##   8  27  14   8   8 
## Total number of clusterings: 41 
## Dendrogram run on 'makeConsensus,final' (cluster index: 2)
## -----------
## Workflow progress:
## clusterMany run? Yes 
## makeConsensus run? Yes 
## makeDendrogram run? Yes 
## mergeClusters run? Yes

We use the plotClusterLegend to check that we assigned them as we expected:

plotClusterLegend(ce_temp,whichCluster="mergeClusters")

Now we will run the plotClusters alignment plot, but we will direct the alignment to use the cluster colors we just gave for the “mergeClusters” cluster. We do this by using the argument existingColors="firstOnly" – but as the argument option indicates it only works if the “mergeClusters” clustering is the first clustering in the plot.

plotClusters(ce_temp,whichClusters="workflow", colData=c("Biological_Condition","Published2"), clusterLabels=FALSE,
               main="Clusters from clusterMany, different order",axisLine=-1,existingColors="firstOnly")

This just created the visualization. We can also save the results of this as we did before with resetColors=TRUE, so that now all of our future clusters make use of this color information. We won’t reset the names, however. Note we can avoid making the plot again with the argument plot=FALSE.

ce_temp<-plotClusters(ce_temp,whichClusters="workflow", colData=c("Biological_Condition","Published2"), resetColors=TRUE,
               main="Clusters from clusterMany, different order",axisLine=-1, clusterLabels=FALSE, existingColors="firstOnly",plot=FALSE)

5.1.1.3 Using only the assigned colors

Once we start manually changing the colors, we will sometimes want to align our clusters, but use the existing cluster colors that are saved in the object. We can force plotClusters to use all the existing color assignments, rather than create its own, with the argument existingColors="all". This makes particular sense if you want to have continuity between plots – i.e. be sure that a particular cluster always has a certain color – but would like to do different variations of plotClusters to get a sense of how similar the clusters are.

For example, we set the colors above based on the cluster alignment produced in the above plotClusters where the clusterings were ordered according to the workflow and making use of the colors we manually assigned to “mergeClusters”. But now we want to plot only the clusters from clusterMany, yet keep the same colors that we just saved so we can compare them. We do this by setting the argument existingColors="all", meaning use all of the existing colors.

par(mfrow=c(1,2))
plotClusters(ce_temp, colData=c("Biological_Condition","Published2"),
             whichClusters="workflow", existingColors="all", clusterLabels=FALSE,
             main="All Workflow Clusters, use existing colors",axisLine=-1)

plotClusters(ce_temp, colData=c("Biological_Condition","Published2"), 
             existingColors="all", whichClusters="clusterMany", clusterLabels=FALSE,
             main="clusterMany Clusters, use existing colors",
             axisLine=-1)

We see that while the order of the samples has changed (because I have a different set of clusters to align) the colors assigned to each cluster have stayed the same, so I can more easily compare the plots.

Note that the use of existingColors="firstOnly" and existingColors="all" will not give the same color assignments, even if the colors have been previously aligned to be the same unless its the exact same set of clusterings in the same order. This is because with each set of ordered clusterings, the realignment between clusters changes and so the assignment of colors changes. Here we will make a ClusterAlignment plot for each of these three options of the argument to demonstrate the difference.

par(mfrow=c(2,2))
plotClusters(ce_temp, colData=c("Biological_Condition","Published2"), 
             existingColors="all", whichClusters="clusterMany", clusterLabels=FALSE,
             main="clusterMany Clusters, use existing colors",
             axisLine=-1)
plotClusters(ce_temp, colData=c("Biological_Condition","Published2"), 
               existingColors="firstOnly", whichClusters="clusterMany", clusterLabels=FALSE,
               main="clusterMany Clusters, use existing of first row only",
               axisLine=-1)
plotClusters(ce_temp, colData=c("Biological_Condition","Published2"), 
            existingColors="ignore", whichClusters="clusterMany", clusterLabels=FALSE,
            main="clusterMany Clusters, default\n(ignoring assigned colors)",
            axisLine=-1)

5.1.1.4 Choosing a different set of colors

plotClusters uses the set of colors defined in the variable massivePalette. This is a set of 484 colors. plotClusters draws from this set of colors sequentially as it goes down the clusterings each time it needs a new color. This is obviously a very large list of colors; the reason for such a long list is that plotClusters will error out if there are not enough colors, so we want to have a large list. If we are running RSEC with many clusterings, and each clustering has many clusters, you can quickly run through many. The first 56 colors are a smaller subset of colors saved as a variable bigPalette, and these have been hand-chosen so that the colors are vibrant and not too similar to each other; the order has also been chosen so that more similar colors are not next to each other. This are the colors that are most frequently seen. massivePalette consists of these colors, plus all of the non-grey colors in colors() that are not already contained in bigPalette

We can examine the colors in bigPalette with the command showPalette:

showPalette()