AbstractSymptomatic heterogeneity in complex diseases reveals differences in molecular states that need to be investigated. However, selecting the numerous parameters of an exploratory clustering analysis in RNA profiling studies requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent and further gene association analyses need to be performed independently. We have developed a suite of tools to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with four datasets characterised by different expression signal strengths. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Even in datasets with less clear biological distinctions, stable subgroups with different expression profiles and clinical associations were found.
Loading the library to access the functions and the two toy datasets: gene expressions and cluster memberships.
## R version 4.2.1 (2022-06-23) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.5 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so ## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_GB LC_COLLATE=C ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  omada_1.0.0 dplyr_1.0.10 glmnet_4.1-4 Matrix_1.5-1 ##  clValid_0.7 cluster_2.1.4 clusterCrit_1.2.8 reshape_0.8.9 ##  ggplot2_3.3.6 diceR_1.2.2 Rcpp_1.0.9 fpc_2.2-9 ##  kernlab_0.9-31 pdfCluster_1.0-3 ## ## loaded via a namespace (and not attached): ##  mclust_6.0.0 lattice_0.20-45 class_7.3-20 assertthat_0.2.1 ##  digest_0.6.30 foreach_1.5.2 utf8_1.2.2 R6_2.5.1 ##  plyr_1.8.7 magic_1.6-0 stats4_4.2.1 evaluate_0.17 ##  pillar_1.8.1 rlang_1.0.6 diptest_0.76-0 jquerylib_0.1.4 ##  rmarkdown_2.17 splines_4.2.1 stringr_1.4.1 munsell_0.5.0 ##  compiler_4.2.1 xfun_0.34 pkgconfig_2.0.3 shape_1.4.6 ##  htmltools_0.5.3 nnet_7.3-18 tidyselect_1.2.0 tibble_3.1.8 ##  codetools_0.2-18 fansi_1.0.3 withr_2.5.0 MASS_7.3-58.1 ##  grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 ##  DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 cli_3.4.1 ##  stringi_1.7.8 cachem_1.0.6 flexmix_2.3-18 robustbase_0.95-0 ##  geometry_0.4.6.1 bslib_0.4.0 generics_0.1.3 vctrs_0.5.0 ##  iterators_1.0.14 tools_4.2.1 glue_1.6.2 DEoptimR_1.0-11 ##  purrr_0.3.5 survival_3.4-0 abind_1.4-5 parallel_4.2.1 ##  fastmap_1.1.0 yaml_2.3.6 colorspace_2.0-3 prabclus_2.3-2 ##  knitr_1.40 sass_0.4.2 modeltools_0.2-23
To investigate the clustering feasibility of a dataset this package provides two simulating functions of stability assessment that simulate a dataset of specific dimensions and calculate the dataset’s stabilities for a range of clusters.
feasibilityAnalysis() generates an idependent dataset for specific number of classes, samples and features while
feasibilityAnalysisDataBased() accepts an existing dataset extracting statistics(means and standard deviations) for a specific number of clusters. Note that these estimations only serve as an indication of the datasets fitness for the dowstream analysis and not as the actual measure of quality as they do not account for the actual signal in the data but only the relation between the number of samples, features and clusters.
# Selecting dimensions and number of clusters new.dataset.analysis <- feasibilityAnalysis(classes = 4, samples = 50, features = 15) # Basing the simulation on an existing dataset and selecting the number of clusters existing.dataset.analysis <- feasibilityAnalysisDataBased(data = toy_genes, classes = 3) # Extract results of either function average.sts.k <- get_average_stabilities_per_k(new.dataset.analysis) maximum.st <- get_max_stability(new.dataset.analysis) average.st <- get_average_stability(new.dataset.analysis) generated.ds <- get_generated_dataset(new.dataset.analysis)
omada() along with a gene expression dataframe and an upper k (number of clusters to be considered) we can run the whole analysis toolkit to automate clustering decision making and produce the estimated optimal clusters. Removal or imputation of NA values is required before running the any of the tools.
# Running the whole cascade of tools inputting an expression dataset # and the upper k (number of clusters) to be investigated omada.analysis <- omada(toy_genes, method.upper.k = 6) # Extract results pa.scores <- get_partition_agreement_scores(omada.analysis) fs.scores <- get_feature_selection_scores(omada.analysis) fs.optimal.features <- get_feature_selection_optimal_features(omada.analysis) fs.optimal.number.of.features <- get_feature_selection_optimal_number_of_features(omada.analysis) cv.scores <- get_cluster_voting_scores(omada.analysis) cv.memberships <- get_cluster_voting_memberships(omada.analysis) cv.metrics.votes <- get_cluster_voting_metric_votes(omada.analysis) cv.k.votes <- get_cluster_voting_k_votes(omada.analysis) sample.memberships <- get_sample_memberships(omada.analysis) sf.coefs <- get_signature_feature_coefs(omada.analysis) # Plot results plot_partition_agreement(omada.analysis)
To select the most appropriate clustering technique for our dataset we compare the internal partition agreement of three different approaches, namely spectral, k-means and hierarchical clustering using the
clusteringMethodSelection() function. We define the upper k to be considered as well as the number of internal comparisons per approach. Increased number of comparisons introduces more robustness and highest run times.
This suite also provides the function to individually calculate the partition agreement between two specific clustering approaches and parameter sets by utilizing function
partitionAgreement() which requires the selection of the 2 algorithms, measures and number of clusters.
To select the features that provide the most stable clusters the function
featureSelection() requires the minimum and maximum number of clusters(k) and the feature step that dictates the rate of each feature set’s increase. It is advised to use the algorithm the previous tools provide.
# Selecting minimum and maximum number of clusters and feature step feature.selection.results <- featureSelection(toy_genes, min.k = 3, max.k = 6, step = 3) # Extract results feature.selection.scores <- get_average_feature_k_stabilities(feature.selection.results) optimal.number.of.features <- get_optimal_number_of_features optimal.features <- get_optimal_features(feature.selection.results) # Plot results plot_average_stabilities(feature.selection.results)
To estimate the most appropriate number of clusters based on an ensemble of internal metrics function
clusterVoting() accepts the minimum and maximum number of clusters to be considered as well as the algorithm of choice (“sc” for spectral, “km” for kmeans and “hr” for hierachical clustering). It is advised to use the feature set and algorithm the previous tools provide.
# Selecting minimum and maximum number of clusters and algorithm to be used cluster.voting.results <- clusterVoting(toy_genes, 4,8,"sc") # Extract results internal.metric.scores <- get_internal_metric_scores(cluster.voting.results) cluster.memberships.k <- get_cluster_memberships_k(cluster.voting.results) metric.votes.k <- get_metric_votes_k(cluster.voting.results) vote.frequencies.k <- get_vote_frequencies_k(cluster.voting.results) # Plot results plot_vote_frequencies(cluster.voting.results)
Previous steps have provided every clustering parameter needed to go through with the partitioning utilising
optimalClustering(). This tool is using the dataset with the most stable feature set, number of clusters(k) and appropriate algorithm. This tool additionally runs through the possible algorithm parameters to retain the one with the highest stability.
# Running the clustering with specific number of clusters(k) and algorithm sample.memberships <- optimalClustering(toy_genes, 4, "spectral") # Extract results memberships <- get_optimal_memberships(sample.memberships) optimal.stability <- get_optimal_stability_score(sample.memberships) optimal.parameter <- get_optimal_parameter_used(sample.memberships)
The first meta analysis of the discovered clusters includes the generation of feature (or gene) signatures associated to each cluster. To acquire the signatures function
geneSignatures() provides the LASSO (regression analysis) coefficients of each gene as well as a plot of the highest 30% of coefficients per cluster. The previously generated cluster memberships and initial dataset are required.