if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("survClust")
survClust\(^1\) is an outcome weighted integrative supervised clustering algorithm, designed to classify patients according to their molecular as well as time-event or end point of interest. Until now, sub-typing in cancer biology has relied heavily upon clustering/mining of molecular data alone. We present classification of samples on molecular data supervised by time-event data like Overall Survival (OS), Progression Free Survival (PFS) etc.
Below is the workflow of proposed survClust method:
getDist
- Compute a weighted distance matrix based on outcome across given m
data types. Standardization across data types to facilitate the integration proces and accounting for non-overlapping samples is also accomplished in this step.
combineDist
- Integrate m
data types by averaging over m
weighted distance matrices.
survClust
and cv_survclust
- Cluster integrated weighted distance matrices via survClust
. Optimal k
is estimated via cross-validation using cv_survclust
. Cross-validated results are assessed over the following performance metrics - the logrank statistic, standardized pooled within-cluster sum of squares (SPWSS) and cluster solutions with class size less than 5 samples.
Note:
The input datatypes needs to be carefully pre-processed. See the data pre-processing section.
cv_survclust
is a wrapper function that cross-validates and outputs cluster assignments. If you run without cross validation and just the commands on its own (getDist
, combineDist
and survClust
), you are over-fitting!
In this document, we use the TCGA UVM data set and a simulation example to demonstrate how to use survClust to perform integrative supervised clustering.
All TCGA data has been downloaded from the TCGA pan-cancer paper\(^2\)
The data and pre-processing steps are largely followed from iCluster\(^3\) manual by - iCluster (Mo Q, Shen R (2022). iClusterPlus: Integrative clustering of multi-type genomic data. R package version 1.32.0.). The pre-processing steps that we used in the manuscript are described here.
Copy Number data was segmented using CBS\(^4\) and reduced to non-redundant regions of alterations using the CNregion
function in iClusterPlus
with default epsilon of 0.001
, keeping in mind that the total numbers of features don’t exceed 10,000. See below and pre-processed data is provided with the package - uvm_dat
See Appendix for how the data attached with the package was processed.
For DNA methylation, mRNA expression and miRNA expression, if a certain feature had more than 20% missing data, that feature was removed and remaining were used for analysis.
For mRNA expression, we further removed genes having a mean expression lower than the threshold of mean expression of lower 10% quantile.
Similarly, methylation probes with mean beta values < 0.1 and > 0.9 were discarded.
For mutation data, we first filtered variants that were classified as SILENT
. Secondly, genes harboring mutants in less than 1% of the samples were also removed. For our case study of UVM, we just removed singleton mutations. See below and pre-processed data is provided with the package - uvm_dat
library(survClust)
library(survival)
library(BiocParallel)
#mutation data
uvm_dat[[1]][1:5,1:5]
#> RYR2 OBSCN TTN DNAH17 PLCB4
#> TCGA-RZ-AB0B 0 0 0 0 0
#> TCGA-V3-A9ZX 0 0 0 0 0
#> TCGA-V3-A9ZY 0 0 0 0 0
#> TCGA-V4-A9E5 0 0 0 0 0
#> TCGA-V4-A9E7 0 0 0 0 0
#copy number data
uvm_dat[[2]][1:5,1:5]
#> chr1.3218610-4658538 chr1.4658538-4735056 chr1.4735056-4735908
#> TCGA-RZ-AB0B -0.3696 -0.3696 -0.3696
#> TCGA-V3-A9ZX 0.0366 0.0366 0.0366
#> TCGA-V3-A9ZY -0.0373 -0.0373 -0.0373
#> TCGA-V4-A9E5 -0.0614 -0.0614 -0.0614
#> TCGA-V4-A9E7 -1.0061 -1.0061 -1.0061
#> chr1.4735908-4736456 chr1.4736456-6876455
#> TCGA-RZ-AB0B -0.3696 -0.3696
#> TCGA-V3-A9ZX 0.0366 0.0366
#> TCGA-V3-A9ZY -0.0373 -0.0373
#> TCGA-V4-A9E5 -0.0614 -0.0614
#> TCGA-V4-A9E7 -0.8835 -0.8835
#TCGA UVM clinical data
head(uvm_survdat)
#> OS.time OS
#> TCGA-RZ-AB0B 149 1
#> TCGA-V3-A9ZX 470 0
#> TCGA-V3-A9ZY 459 0
#> TCGA-V4-A9E5 2499 0
#> TCGA-V4-A9E7 415 1
#> TCGA-V4-A9E8 808 1
To run supervised integrative clustering analysis - we will be calling cv_survclust
. Cross-validation takes time and we will be using the BiocParallel
package and splitting cross-validation across k
clusters.
We will perform 3-fold cross-validation over 10 rounds as follows:
cv_rounds = 10
#function to do cross validation
uvm_all_cvrounds<-function(kk){
this.fold<-3
fit<-list()
for (i in seq_len(cv_rounds)){
fit[[i]] <- cv_survclust(uvm_dat,uvm_survdat,kk,this.fold)
print(paste0("finished ", i, " rounds for k= ", kk))
}
return(fit)
}
We will be using this code for both UVM data and then use this as a simulation example as discussed in the manuscript\(^1\).
Note that 10 rounds of cross-validation is not enough and we recommend at least 50 rounds of cross-validation to get some stability in results.
ptm <- Sys.time()
cv.fit<-bplapply(2:7, uvm_all_cvrounds)
ptm2 <- Sys.time()
#> ptm
#[1] "2022-09-05 20:54:21 EDT"
#> ptm2
#[1] "2022-09-05 21:01:12 EDT"
Supervised integrative clustering was performed on TCGA UVM consisting of about 80 samples and 87 genes in the mutation data and CN data summarized over 749 segments.
lapply(uvm_dat, function(x) dim(x))
#> $uvm_mut
#> [1] 80 87
#>
#> $uvm_cn
#> [1] 80 749
The above process took about ~7 minutes on a macOS Catalina with 2.6 GHz Dual-Core Intel Core i5 running on 8Gb RAM. If you wish to skip the run-time, output is available as uvm_survClust_cv.fit
. Due to 10 rounds of cross validation, the results from your run might differ from what is provided.
The output is a list object consisting of 6 sub-lists for \(k = 2:7\), with 10 cv_survclust
outputs (for each round of cross-validation), each consisting of cv.labels
, cv.logrank
, cv.spwss
for 3 folds.
#for k=2, 1st round of cross validation
names(uvm_survClust_cv.fit[[1]][[1]])
#> [1] "cv.labels" "cv.logrank" "cv.spwss"
Now, let’s use survClust::getStats
to summarize and survClust::plotStats
to plot some of the supervised integrative clustering metrics.
ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)
In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see how logrank peaks at k=4
.
the topright plot is summarizing SPWSS
or Standardized Pooled Within Sum of Squares
. Briefly, pooled within-cluster sum of squares were standardized by the total sum of squares similar to methodology used in the gap statistic\(^5\) to select the appropriate number of clusters.
Here SPWSS
decreases monotonically as the number of clusters k
increases. The optimal number of clusters is where SPWSS
is minimized and creates an “elbow” or a point of inflection, where addition of more clusters does not improve cluster separation. For example, here the plot elbows at k=4
Another property of SPWSS
is that it can be used to compare among different datasets as it lies between 0 and 1 after standardization. This is useful for comparing survClust runs between individual data types and when we integrate them.
The last plot, on the bottomleft, shows for each k
how many k
class labels have <=5
samples in 10 rounds of cross validation. In our case here, for k >5
the number of classes with <=5
samples increases, so we can choose k=4
.
k4 <- cv_voting(uvm_survClust_cv.fit, getDist(uvm_dat, uvm_survdat), pick_k=4)
table(k4)
#> k4
#> 1 2 3 4
#> 23 13 18 26
plot(survfit(Surv(uvm_survdat[,1], uvm_survdat[,2])~k4), mark.time=TRUE, col=1:4)
Let’s see some of the differentiating features in mutation data.
mut_k4_test <- apply(uvm_dat[[1]],2,function(x) fisher.test(x,k4)$p.value)
head(sort(p.adjust(mut_k4_test)))
#> SF3B1 GNAQ GNA11 BAP1 EIF1AX RYR2
#> 3.916339e-09 1.956023e-06 3.456072e-05 1.293877e-01 1.933046e-01 1.000000e+00
All the figures as shown in the manuscript are plotted using panelmap
. It is available on GitHub over here - https://github.com/arorarshi/panelmap
We will use it to see the distribution of these mutations across the 4 clusters from a previous run. An example from previous runs is shown below. This is only for illustration process and the cluster groups might differ between the latest run.
And for Copy Number data as follows -
cn_imagedata <- uvm_dat[[2]]
cn_imagedata[cn_imagedata < -1.5] <- -1.5
cn_imagedata[cn_imagedata > 1.5] <- 1.5
oo <- order(k4)
cn_imagedata <- cn_imagedata[oo,]
cn_imagedata <- cn_imagedata[,ncol(cn_imagedata):1]
#image(cn_imagedata,col=gplots::bluered(50),axes=F)
#image y labels - chr names
cnames <- colnames(cn_imagedata)
cnames <- unlist(lapply(strsplit(cnames, "\\."), function(x) x[1]))
tt <- table(cnames)
nn <- paste0("chr",1:22)
chr.labels <- rep(NA, length(cnames))
index <- 1
chr.labels[1] <- "1"
for(i in seq_len(length(nn)-1)) {
index <- index + tt[nn[i]]
chr.labels[index] <- gsub("chr","",nn[i+1])
}
idx <- which(!(is.na(chr.labels)))
image(cn_imagedata,col=gplots::bluered(50),axes=FALSE)
axis(2, at = 1 - (idx/length(cnames)), labels = chr.labels[idx], las=1, cex.axis=0.8)
abline(v = c(cumsum(prop.table(table(k4)))))
abline(h=c(0,1))
Simulation example is presented in the survClust manuscript \(^1\).See Figure (S1) and Supplementary Note. Below we provide code on how we generated the simulated dataset and how to run it via survClust.
#function to do cross validation
sim_cvrounds<-function(kk){
this.fold<-3
fit<-list()
for (i in seq_len(cv_rounds)){
fit[[i]] <- cv_survclust(simdat, simsurvdat,kk,this.fold)
print(paste0("finished ", i, " rounds for k= ", kk))
}
return(fit)
}
ptm <- Sys.time()
sim_cv.fit<-bplapply(2:7, sim_cvrounds)
#> [1] "finished 1 rounds for k= 2"
#> [1] "finished 2 rounds for k= 2"
#> [1] "finished 3 rounds for k= 2"
#> [1] "finished 4 rounds for k= 2"
#> [1] "finished 5 rounds for k= 2"
#> [1] "finished 6 rounds for k= 2"
#> [1] "finished 7 rounds for k= 2"
#> [1] "finished 8 rounds for k= 2"
#> [1] "finished 9 rounds for k= 2"
#> [1] "finished 10 rounds for k= 2"
#> [1] "finished 1 rounds for k= 3"
#> [1] "finished 2 rounds for k= 3"
#> [1] "finished 3 rounds for k= 3"
#> [1] "finished 4 rounds for k= 3"
#> [1] "finished 5 rounds for k= 3"
#> [1] "finished 6 rounds for k= 3"
#> [1] "finished 7 rounds for k= 3"
#> [1] "finished 8 rounds for k= 3"
#> [1] "finished 9 rounds for k= 3"
#> [1] "finished 10 rounds for k= 3"
#> [1] "finished 1 rounds for k= 4"
#> [1] "finished 2 rounds for k= 4"
#> [1] "finished 3 rounds for k= 4"
#> [1] "finished 4 rounds for k= 4"
#> [1] "finished 5 rounds for k= 4"
#> [1] "finished 6 rounds for k= 4"
#> [1] "finished 7 rounds for k= 4"
#> [1] "finished 8 rounds for k= 4"
#> [1] "finished 9 rounds for k= 4"
#> [1] "finished 10 rounds for k= 4"
#> [1] "finished 1 rounds for k= 5"
#> [1] "finished 2 rounds for k= 5"
#> [1] "finished 3 rounds for k= 5"
#> [1] "finished 4 rounds for k= 5"
#> [1] "finished 5 rounds for k= 5"
#> [1] "finished 6 rounds for k= 5"
#> [1] "finished 7 rounds for k= 5"
#> [1] "finished 8 rounds for k= 5"
#> [1] "finished 9 rounds for k= 5"
#> [1] "finished 10 rounds for k= 5"
#> [1] "finished 1 rounds for k= 6"
#> [1] "finished 2 rounds for k= 6"
#> [1] "finished 3 rounds for k= 6"
#> [1] "finished 4 rounds for k= 6"
#> [1] "finished 5 rounds for k= 6"
#> [1] "finished 6 rounds for k= 6"
#> [1] "finished 7 rounds for k= 6"
#> [1] "finished 8 rounds for k= 6"
#> [1] "finished 9 rounds for k= 6"
#> [1] "finished 10 rounds for k= 6"
#> [1] "finished 1 rounds for k= 7"
#> [1] "finished 2 rounds for k= 7"
#> [1] "finished 3 rounds for k= 7"
#> [1] "finished 4 rounds for k= 7"
#> [1] "finished 5 rounds for k= 7"
#> [1] "finished 6 rounds for k= 7"
#> [1] "finished 7 rounds for k= 7"
#> [1] "finished 8 rounds for k= 7"
#> [1] "finished 9 rounds for k= 7"
#> [1] "finished 10 rounds for k= 7"
ptm2 <- Sys.time()
ptm
#> [1] "2024-10-29 19:48:34 EDT"
ptm2
#> [1] "2024-10-29 19:49:24 EDT"
The above process took about 50.0853891372681 on a macOS Catalina with 2.6 GHz Dual-Core Intel Core i5 running on 8Gb RAM.
The output is a list object consisting of 6 sub-lists for \(k = 2:7\), with 10 cv_survclust
output (for each round of cross-validation), each consisting of cv.labels
, cv.logrank
, cv.spwss
for 3 folds.
Now, let’s use survClust::getStats
to summarize and survClust::plotStats
to plot some of the supervised integrative clustering metrics.
ss_stats <- getStats(sim_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)
In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see that logrank peaks at k=3
.
The topright plot is summarizing SPWSS
or Standardized Pooled Within Sum of Squares
. The plot elbows at k=4
The last plot, on the bottomleft, shows for each k
how many k
class labels have <=5
samples in 10 rounds of cross validation. In the simulation example, cluster solutions having <5
samples increases at k=7
k3 <- cv_voting(sim_cv.fit, getDist(simdat, simsurvdat), pick_k=3)
sim_class_labels <- c(rep(1, 50), rep(2,50), rep(3,50))
table(k3, sim_class_labels)
#> sim_class_labels
#> k3 1 2 3
#> 1 50 0 0
#> 2 0 1 49
#> 3 0 49 1
plot(survfit(Surv(simsurvdat[,1], simsurvdat[,2]) ~ k3), mark.time=TRUE, col=1:3)
We see, that we are able to get the simulated class labels as survClust solution with good concordance.
survClust allows for integration of one or more data types. The data can be either continuous (RNA, methylation, miRNA or protein expression or copy number segmentation values) or binary (mutation status, wt=0, mut =1
).
One can perform survClust on individual data alone. In this example, we will perform survClust on TCGA UVM mutation data alone.
#function to do cross validation
cvrounds_mut <- function(kk){
this.fold<-3
fit<-list()
for (i in seq_len(cv_rounds)){
fit[[i]] <- cv_survclust(uvm_mut_dat, uvm_survdat,kk,this.fold, type="mut")
print(paste0("finished ", i, " rounds for k= ", kk))
}
return(fit)
}
#let's create a list object with just the mutation data
uvm_mut_dat <- list()
uvm_mut_dat[[1]] <- uvm_dat[[1]]
ptm <- Sys.time()
uvm_mut_cv.fit<-bplapply(2:7, cvrounds_mut)
#> [1] "finished 1 rounds for k= 4"
#> [1] "finished 2 rounds for k= 4"
#> [1] "finished 3 rounds for k= 4"
#> [1] "finished 4 rounds for k= 4"
#> [1] "finished 5 rounds for k= 4"
#> [1] "finished 6 rounds for k= 4"
#> [1] "finished 7 rounds for k= 4"
#> [1] "finished 8 rounds for k= 4"
#> [1] "finished 9 rounds for k= 4"
#> [1] "finished 10 rounds for k= 4"
#> [1] "finished 1 rounds for k= 5"
#> [1] "finished 2 rounds for k= 5"
#> [1] "finished 3 rounds for k= 5"
#> [1] "finished 4 rounds for k= 5"
#> [1] "finished 5 rounds for k= 5"
#> [1] "finished 6 rounds for k= 5"
#> [1] "finished 7 rounds for k= 5"
#> [1] "finished 8 rounds for k= 5"
#> [1] "finished 9 rounds for k= 5"
#> [1] "finished 10 rounds for k= 5"
#> [1] "finished 1 rounds for k= 2"
#> [1] "finished 2 rounds for k= 2"
#> [1] "finished 3 rounds for k= 2"
#> [1] "finished 4 rounds for k= 2"
#> [1] "finished 5 rounds for k= 2"
#> [1] "finished 6 rounds for k= 2"
#> [1] "finished 7 rounds for k= 2"
#> [1] "finished 8 rounds for k= 2"
#> [1] "finished 9 rounds for k= 2"
#> [1] "finished 10 rounds for k= 2"
#> [1] "finished 1 rounds for k= 3"
#> [1] "finished 2 rounds for k= 3"
#> [1] "finished 3 rounds for k= 3"
#> [1] "finished 4 rounds for k= 3"
#> [1] "finished 5 rounds for k= 3"
#> [1] "finished 6 rounds for k= 3"
#> [1] "finished 7 rounds for k= 3"
#> [1] "finished 8 rounds for k= 3"
#> [1] "finished 9 rounds for k= 3"
#> [1] "finished 10 rounds for k= 3"
#> [1] "finished 1 rounds for k= 6"
#> [1] "finished 2 rounds for k= 6"
#> [1] "finished 3 rounds for k= 6"
#> [1] "finished 4 rounds for k= 6"
#> [1] "finished 5 rounds for k= 6"
#> [1] "finished 6 rounds for k= 6"
#> [1] "finished 7 rounds for k= 6"
#> [1] "finished 8 rounds for k= 6"
#> [1] "finished 9 rounds for k= 6"
#> [1] "finished 10 rounds for k= 6"
#> [1] "finished 1 rounds for k= 7"
#> [1] "finished 2 rounds for k= 7"
#> [1] "finished 3 rounds for k= 7"
#> [1] "finished 4 rounds for k= 7"
#> [1] "finished 5 rounds for k= 7"
#> [1] "finished 6 rounds for k= 7"
#> [1] "finished 7 rounds for k= 7"
#> [1] "finished 8 rounds for k= 7"
#> [1] "finished 9 rounds for k= 7"
#> [1] "finished 10 rounds for k= 7"
ptm2 <- Sys.time()
ss_stats <- getStats(uvm_mut_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)