Contents

1 Install


if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("survClust")

2 Overview

2.1 The tldr version

survClust\(^1\) is an outcome weighted integrative supervised clustering algorithm, designed to classify patients according to their molecular as well as time-event or end point of interest. Until now, sub-typing in cancer biology has relied heavily upon clustering/mining of molecular data alone. We present classification of samples on molecular data supervised by time-event data like Overall Survival (OS), Progression Free Survival (PFS) etc.

Below is the workflow of proposed survClust method:

getDist - Compute a weighted distance matrix based on outcome across given m data types. Standardization across data types to facilitate the integration proces and accounting for non-overlapping samples is also accomplished in this step.

combineDist - Integrate m data types by averaging over m weighted distance matrices.

survClust and cv_survclust - Cluster integrated weighted distance matrices via survClust. Optimal k is estimated via cross-validation using cv_survclust. Cross-validated results are assessed over the following performance metrics - the logrank statistic, standardized pooled within-cluster sum of squares (SPWSS) and cluster solutions with class size less than 5 samples.

Note:

  1. The input datatypes needs to be carefully pre-processed. See the data pre-processing section.

  2. cv_survclust is a wrapper function that cross-validates and outputs cluster assignments. If you run without cross validation and just the commands on its own (getDist, combineDist and survClust), you are over-fitting!

In this document, we use the TCGA UVM data set and a simulation example to demonstrate how to use survClust to perform integrative supervised clustering.

All TCGA data has been downloaded from the TCGA pan-cancer paper\(^2\)

3 Data and Pre-processing

The data and pre-processing steps are largely followed from iCluster\(^3\) manual by - iCluster (Mo Q, Shen R (2022). iClusterPlus: Integrative clustering of multi-type genomic data. R package version 1.32.0.). The pre-processing steps that we used in the manuscript are described here.

library(survClust)
library(survival)
library(BiocParallel)

#mutation data
uvm_dat[[1]][1:5,1:5]
#>              RYR2 OBSCN TTN DNAH17 PLCB4
#> TCGA-RZ-AB0B    0     0   0      0     0
#> TCGA-V3-A9ZX    0     0   0      0     0
#> TCGA-V3-A9ZY    0     0   0      0     0
#> TCGA-V4-A9E5    0     0   0      0     0
#> TCGA-V4-A9E7    0     0   0      0     0

#copy number data
uvm_dat[[2]][1:5,1:5]
#>              chr1.3218610-4658538 chr1.4658538-4735056 chr1.4735056-4735908
#> TCGA-RZ-AB0B              -0.3696              -0.3696              -0.3696
#> TCGA-V3-A9ZX               0.0366               0.0366               0.0366
#> TCGA-V3-A9ZY              -0.0373              -0.0373              -0.0373
#> TCGA-V4-A9E5              -0.0614              -0.0614              -0.0614
#> TCGA-V4-A9E7              -1.0061              -1.0061              -1.0061
#>              chr1.4735908-4736456 chr1.4736456-6876455
#> TCGA-RZ-AB0B              -0.3696              -0.3696
#> TCGA-V3-A9ZX               0.0366               0.0366
#> TCGA-V3-A9ZY              -0.0373              -0.0373
#> TCGA-V4-A9E5              -0.0614              -0.0614
#> TCGA-V4-A9E7              -0.8835              -0.8835

#TCGA UVM clinical data
head(uvm_survdat)
#>              OS.time OS
#> TCGA-RZ-AB0B     149  1
#> TCGA-V3-A9ZX     470  0
#> TCGA-V3-A9ZY     459  0
#> TCGA-V4-A9E5    2499  0
#> TCGA-V4-A9E7     415  1
#> TCGA-V4-A9E8     808  1

4 Supervised integrative cluster analysis

To run supervised integrative clustering analysis - we will be calling cv_survclust. Cross-validation takes time and we will be using the BiocParallel package and splitting cross-validation across k clusters.

We will perform 3-fold cross-validation over 10 rounds as follows:


cv_rounds = 10

#function to do cross validation 
uvm_all_cvrounds<-function(kk){
    this.fold<-3
    fit<-list()
    for (i in seq_len(cv_rounds)){
        fit[[i]] <- cv_survclust(uvm_dat,uvm_survdat,kk,this.fold)
        print(paste0("finished ", i, " rounds for k= ", kk))
    }
    return(fit)
}

We will be using this code for both UVM data and then use this as a simulation example as discussed in the manuscript\(^1\).

Note that 10 rounds of cross-validation is not enough and we recommend at least 50 rounds of cross-validation to get some stability in results.

4.1 UVM data

ptm <- Sys.time()
cv.fit<-bplapply(2:7, uvm_all_cvrounds)
ptm2 <- Sys.time()

#> ptm
#[1] "2022-09-05 20:54:21 EDT"
#> ptm2
#[1] "2022-09-05 21:01:12 EDT"

Supervised integrative clustering was performed on TCGA UVM consisting of about 80 samples and 87 genes in the mutation data and CN data summarized over 749 segments.

lapply(uvm_dat, function(x) dim(x))
#> $uvm_mut
#> [1] 80 87
#> 
#> $uvm_cn
#> [1]  80 749

The above process took about ~7 minutes on a macOS Catalina with 2.6 GHz Dual-Core Intel Core i5 running on 8Gb RAM. If you wish to skip the run-time, output is available as uvm_survClust_cv.fit. Due to 10 rounds of cross validation, the results from your run might differ from what is provided.

The output is a list object consisting of 6 sub-lists for \(k = 2:7\), with 10 cv_survclust outputs (for each round of cross-validation), each consisting of cv.labels, cv.logrank, cv.spwss for 3 folds.


#for k=2, 1st round of cross validation
names(uvm_survClust_cv.fit[[1]][[1]])
#> [1] "cv.labels"  "cv.logrank" "cv.spwss"

Now, let’s use survClust::getStats to summarize and survClust::plotStats to plot some of the supervised integrative clustering metrics.


ss_stats <- getStats(uvm_survClust_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)
survClust analysis of TCGA UVM mutation and Copy Number data

Figure 1: survClust analysis of TCGA UVM mutation and Copy Number data

4.1.1 Picking k cluster solution

In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see how logrank peaks at k=4.

the topright plot is summarizing SPWSS or Standardized Pooled Within Sum of Squares. Briefly, pooled within-cluster sum of squares were standardized by the total sum of squares similar to methodology used in the gap statistic\(^5\) to select the appropriate number of clusters.

Here SPWSS decreases monotonically as the number of clusters k increases. The optimal number of clusters is where SPWSS is minimized and creates an “elbow” or a point of inflection, where addition of more clusters does not improve cluster separation. For example, here the plot elbows at k=4

Another property of SPWSS is that it can be used to compare among different datasets as it lies between 0 and 1 after standardization. This is useful for comparing survClust runs between individual data types and when we integrate them.

The last plot, on the bottomleft, shows for each k how many k class labels have <=5 samples in 10 rounds of cross validation. In our case here, for k >5 the number of classes with <=5 samples increases, so we can choose k=4.


k4 <- cv_voting(uvm_survClust_cv.fit, getDist(uvm_dat, uvm_survdat), pick_k=4)
table(k4)
#> k4
#>  1  2  3  4 
#> 23 13 18 26

plot(survfit(Surv(uvm_survdat[,1], uvm_survdat[,2])~k4), mark.time=TRUE, col=1:4)
survClust KM analysis for integrated TCGA UVM Mutation and Copy Number for k=4

Figure 2: survClust KM analysis for integrated TCGA UVM Mutation and Copy Number for k=4

Let’s see some of the differentiating features in mutation data.


mut_k4_test <- apply(uvm_dat[[1]],2,function(x) fisher.test(x,k4)$p.value)
head(sort(p.adjust(mut_k4_test)))
#>        SF3B1         GNAQ        GNA11         BAP1       EIF1AX         RYR2 
#> 3.916339e-09 1.956023e-06 3.456072e-05 1.293877e-01 1.933046e-01 1.000000e+00

All the figures as shown in the manuscript are plotted using panelmap. It is available on GitHub over here - https://github.com/arorarshi/panelmap

We will use it to see the distribution of these mutations across the 4 clusters from a previous run. An example from previous runs is shown below. This is only for illustration process and the cluster groups might differ between the latest run.

And for Copy Number data as follows -


cn_imagedata <- uvm_dat[[2]]
cn_imagedata[cn_imagedata < -1.5] <- -1.5
cn_imagedata[cn_imagedata > 1.5] <- 1.5

oo <- order(k4)
cn_imagedata <- cn_imagedata[oo,]
cn_imagedata <- cn_imagedata[,ncol(cn_imagedata):1]
#image(cn_imagedata,col=gplots::bluered(50),axes=F)

#image y labels - chr names
cnames <- colnames(cn_imagedata)
cnames <- unlist(lapply(strsplit(cnames, "\\."), function(x) x[1]))
tt <- table(cnames)
nn <- paste0("chr",1:22)

chr.labels <- rep(NA, length(cnames))

index <- 1
chr.labels[1] <- "1"

for(i in seq_len(length(nn)-1)) {
    index <- index + tt[nn[i]]
    chr.labels[index] <- gsub("chr","",nn[i+1])
}

idx <- which(!(is.na(chr.labels)))

image(cn_imagedata,col=gplots::bluered(50),axes=FALSE)

axis(2, at = 1 - (idx/length(cnames)), labels = chr.labels[idx], las=1, cex.axis=0.8)
abline(v = c(cumsum(prop.table(table(k4)))))
abline(h=c(0,1))
TCGA UVM Copy Number k=4

Figure 3: TCGA UVM Copy Number k=4

4.2 simulation example

Simulation example is presented in the survClust manuscript \(^1\).See Figure (S1) and Supplementary Note. Below we provide code on how we generated the simulated dataset and how to run it via survClust.


#function to do cross validation 
sim_cvrounds<-function(kk){
    this.fold<-3
    fit<-list()
    for (i in seq_len(cv_rounds)){
        fit[[i]] <- cv_survclust(simdat, simsurvdat,kk,this.fold)
        print(paste0("finished ", i, " rounds for k= ", kk))
    }
    return(fit)
}


ptm <- Sys.time()
sim_cv.fit<-bplapply(2:7, sim_cvrounds)
#> [1] "finished 1 rounds for k= 2"
#> [1] "finished 2 rounds for k= 2"
#> [1] "finished 3 rounds for k= 2"
#> [1] "finished 4 rounds for k= 2"
#> [1] "finished 5 rounds for k= 2"
#> [1] "finished 6 rounds for k= 2"
#> [1] "finished 7 rounds for k= 2"
#> [1] "finished 8 rounds for k= 2"
#> [1] "finished 9 rounds for k= 2"
#> [1] "finished 10 rounds for k= 2"
#> [1] "finished 1 rounds for k= 3"
#> [1] "finished 2 rounds for k= 3"
#> [1] "finished 3 rounds for k= 3"
#> [1] "finished 4 rounds for k= 3"
#> [1] "finished 5 rounds for k= 3"
#> [1] "finished 6 rounds for k= 3"
#> [1] "finished 7 rounds for k= 3"
#> [1] "finished 8 rounds for k= 3"
#> [1] "finished 9 rounds for k= 3"
#> [1] "finished 10 rounds for k= 3"
#> [1] "finished 1 rounds for k= 4"
#> [1] "finished 2 rounds for k= 4"
#> [1] "finished 3 rounds for k= 4"
#> [1] "finished 4 rounds for k= 4"
#> [1] "finished 5 rounds for k= 4"
#> [1] "finished 6 rounds for k= 4"
#> [1] "finished 7 rounds for k= 4"
#> [1] "finished 8 rounds for k= 4"
#> [1] "finished 9 rounds for k= 4"
#> [1] "finished 10 rounds for k= 4"
#> [1] "finished 1 rounds for k= 5"
#> [1] "finished 2 rounds for k= 5"
#> [1] "finished 3 rounds for k= 5"
#> [1] "finished 4 rounds for k= 5"
#> [1] "finished 5 rounds for k= 5"
#> [1] "finished 6 rounds for k= 5"
#> [1] "finished 7 rounds for k= 5"
#> [1] "finished 8 rounds for k= 5"
#> [1] "finished 9 rounds for k= 5"
#> [1] "finished 10 rounds for k= 5"
#> [1] "finished 1 rounds for k= 6"
#> [1] "finished 2 rounds for k= 6"
#> [1] "finished 3 rounds for k= 6"
#> [1] "finished 4 rounds for k= 6"
#> [1] "finished 5 rounds for k= 6"
#> [1] "finished 6 rounds for k= 6"
#> [1] "finished 7 rounds for k= 6"
#> [1] "finished 8 rounds for k= 6"
#> [1] "finished 9 rounds for k= 6"
#> [1] "finished 10 rounds for k= 6"
#> [1] "finished 1 rounds for k= 7"
#> [1] "finished 2 rounds for k= 7"
#> [1] "finished 3 rounds for k= 7"
#> [1] "finished 4 rounds for k= 7"
#> [1] "finished 5 rounds for k= 7"
#> [1] "finished 6 rounds for k= 7"
#> [1] "finished 7 rounds for k= 7"
#> [1] "finished 8 rounds for k= 7"
#> [1] "finished 9 rounds for k= 7"
#> [1] "finished 10 rounds for k= 7"
ptm2 <- Sys.time()

ptm
#> [1] "2024-10-29 19:48:34 EDT"
ptm2
#> [1] "2024-10-29 19:49:24 EDT"

The above process took about 50.0853891372681 on a macOS Catalina with 2.6 GHz Dual-Core Intel Core i5 running on 8Gb RAM.

The output is a list object consisting of 6 sub-lists for \(k = 2:7\), with 10 cv_survclust output (for each round of cross-validation), each consisting of cv.labels, cv.logrank, cv.spwss for 3 folds.

Now, let’s use survClust::getStats to summarize and survClust::plotStats to plot some of the supervised integrative clustering metrics.


ss_stats <- getStats(sim_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)
survClust analysis of simulated dataset

Figure 4: survClust analysis of simulated dataset

4.2.1 Picking k cluster solution

In the above plot, the topleft plot is summarizing logrank over 10 rounds of cross-validated class labels across 3-fold cross-validation. Here, we see that logrank peaks at k=3.

The topright plot is summarizing SPWSS or Standardized Pooled Within Sum of Squares. The plot elbows at k=4

The last plot, on the bottomleft, shows for each k how many k class labels have <=5 samples in 10 rounds of cross validation. In the simulation example, cluster solutions having <5 samples increases at k=7


k3 <- cv_voting(sim_cv.fit, getDist(simdat, simsurvdat), pick_k=3)

sim_class_labels <- c(rep(1, 50), rep(2,50), rep(3,50))

table(k3, sim_class_labels)
#>    sim_class_labels
#> k3   1  2  3
#>   1 50  0  0
#>   2  0  1 49
#>   3  0 49  1

plot(survfit(Surv(simsurvdat[,1], simsurvdat[,2]) ~ k3), mark.time=TRUE, col=1:3)
survClust k=3 class labels KM analysis for simulated dataset

Figure 5: survClust k=3 class labels KM analysis for simulated dataset

We see, that we are able to get the simulated class labels as survClust solution with good concordance.

4.3 Bonus - UVM mutation data alone

survClust allows for integration of one or more data types. The data can be either continuous (RNA, methylation, miRNA or protein expression or copy number segmentation values) or binary (mutation status, wt=0, mut =1).

One can perform survClust on individual data alone. In this example, we will perform survClust on TCGA UVM mutation data alone.


#function to do cross validation 
cvrounds_mut <- function(kk){
    this.fold<-3
    fit<-list()
    for (i in seq_len(cv_rounds)){
        fit[[i]] <- cv_survclust(uvm_mut_dat, uvm_survdat,kk,this.fold, type="mut")
        print(paste0("finished ", i, " rounds for k= ", kk))
    }
    return(fit)
}

#let's create a list object with just the mutation data 
uvm_mut_dat <- list()
uvm_mut_dat[[1]] <- uvm_dat[[1]]

ptm <- Sys.time()
uvm_mut_cv.fit<-bplapply(2:7, cvrounds_mut)
#> [1] "finished 1 rounds for k= 4"
#> [1] "finished 2 rounds for k= 4"
#> [1] "finished 3 rounds for k= 4"
#> [1] "finished 4 rounds for k= 4"
#> [1] "finished 5 rounds for k= 4"
#> [1] "finished 6 rounds for k= 4"
#> [1] "finished 7 rounds for k= 4"
#> [1] "finished 8 rounds for k= 4"
#> [1] "finished 9 rounds for k= 4"
#> [1] "finished 10 rounds for k= 4"
#> [1] "finished 1 rounds for k= 5"
#> [1] "finished 2 rounds for k= 5"
#> [1] "finished 3 rounds for k= 5"
#> [1] "finished 4 rounds for k= 5"
#> [1] "finished 5 rounds for k= 5"
#> [1] "finished 6 rounds for k= 5"
#> [1] "finished 7 rounds for k= 5"
#> [1] "finished 8 rounds for k= 5"
#> [1] "finished 9 rounds for k= 5"
#> [1] "finished 10 rounds for k= 5"
#> [1] "finished 1 rounds for k= 2"
#> [1] "finished 2 rounds for k= 2"
#> [1] "finished 3 rounds for k= 2"
#> [1] "finished 4 rounds for k= 2"
#> [1] "finished 5 rounds for k= 2"
#> [1] "finished 6 rounds for k= 2"
#> [1] "finished 7 rounds for k= 2"
#> [1] "finished 8 rounds for k= 2"
#> [1] "finished 9 rounds for k= 2"
#> [1] "finished 10 rounds for k= 2"
#> [1] "finished 1 rounds for k= 3"
#> [1] "finished 2 rounds for k= 3"
#> [1] "finished 3 rounds for k= 3"
#> [1] "finished 4 rounds for k= 3"
#> [1] "finished 5 rounds for k= 3"
#> [1] "finished 6 rounds for k= 3"
#> [1] "finished 7 rounds for k= 3"
#> [1] "finished 8 rounds for k= 3"
#> [1] "finished 9 rounds for k= 3"
#> [1] "finished 10 rounds for k= 3"
#> [1] "finished 1 rounds for k= 6"
#> [1] "finished 2 rounds for k= 6"
#> [1] "finished 3 rounds for k= 6"
#> [1] "finished 4 rounds for k= 6"
#> [1] "finished 5 rounds for k= 6"
#> [1] "finished 6 rounds for k= 6"
#> [1] "finished 7 rounds for k= 6"
#> [1] "finished 8 rounds for k= 6"
#> [1] "finished 9 rounds for k= 6"
#> [1] "finished 10 rounds for k= 6"
#> [1] "finished 1 rounds for k= 7"
#> [1] "finished 2 rounds for k= 7"
#> [1] "finished 3 rounds for k= 7"
#> [1] "finished 4 rounds for k= 7"
#> [1] "finished 5 rounds for k= 7"
#> [1] "finished 6 rounds for k= 7"
#> [1] "finished 7 rounds for k= 7"
#> [1] "finished 8 rounds for k= 7"
#> [1] "finished 9 rounds for k= 7"
#> [1] "finished 10 rounds for k= 7"
ptm2 <- Sys.time()

ss_stats <- getStats(uvm_mut_cv.fit, kk=7, cvr=10)
plotStats(ss_stats, 2:7)