GSgalgoR user Guide

Martin E. Guerrero-Gimenez1,2*, Juan Manuel Fernandez-Muñoz1,2 and Carlos A. Catania3**

1Laboratory of Oncology, Institute of Medicine and Experimental Biology of Cuyo (IMBECU), National Scientific and Technical Research Council (CONICET), Mendoza, Argentina.
2Institute of Biochemistry and Biotechnology, Medical School, National University of Cuyo, Mendoza, Argentina.
3LABSIN, Engineering School, National University of Cuyo, Mendoza, Argentina.

*mguerrero@mendoza-conicet.gob.ar
**harpo@ingenieria.uncuyo.edu.ar

2025-04-15

Abstract

We report a novel method to identify specific transcriptomic phenotypes based on an elitist non-dominated sorting genetic algorithm that combines the advantages of clustering methods and the exploratory properties of genetic algorithms to discover biologically and clinically relevant molecular subtypes in different cancers.

1 Overview
2 Algorithm
3 Installation
- 3.1 GSgalgoR library
- 3.2 Examples datasets
4 Examples
5 Case study
6 Session info

1 Overview

In the new era of omics data, precision medicine has become the new paradigm of cancer treatment. Among all available omics techniques, gene expression profiling, in particular, has been increasingly used to classify tumor subtypes with different biological behavior. Cancer subtype discovery is usually approached from two possible perspectives:

-Using the molecular data alone with unsupervised techniques such as clustering analysis. -Using supervised techniques focusing entirely on survival data.

The problem of finding patients subgroups with survival differences while maintaining cluster consistency could be viewed as a bi-objective problem, where there is a trade-off between the separability of the different groups and the ability of a given signature to consistently distinguish patients with different clinical outcomes. This gives rise to a set of optimal solutions, also known as Pareto-optimal solutions. To overcome these issues, we combined the advantages of clustering methods for grouping heterogeneous omics data and the search properties of genetic algorithms in GSgalgoR: A flexible yet robust multi-objective meta-heuristic for disease subtype discovery based on an elitist non-dominated sorting genetic algorithm (NSGA-II), driven by the underlying premise of maximizing survival differences between groups while getting high consistency and robustness of the clusters obtained.

2 Algorithm

In the GSgalgoR package, the NSGA-II framework was used for finding multiple Pareto-optimal solutions to classify patients according to their gene expression patterns. Basically, NSGA-II starts with a population of competing individuals which are evaluated under a set of fitness functions that estimate the survival differences and cohesiveness of the different transcriptomic groups. Then, solutions are ranked and sorted according to their non-domination level which will affect the way they are chosen to be submitted to the so-called “evolutionary operators” such as crossover and mutation. Once a set of well-suited solutions are selected and reproduced, a new offspring of individuals composed of a mixture of the “genetic information” of the parents is obtained. Parents and offspring are pooled and the best-ranked solutions are selected and passed to the next generation which will start over the same process again.

3 Installation

3.1 GSgalgoR library

To install GSgalgoR package, start R and enter:


if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GSgalgoR")
library(GSgalgoR)

Alternatively you can install GSgalgoR from github using the devtool package

devtools::install_github("https://github.com/harpomaxx/GSgalgoR")
library(GSgalgoR)

3.2 Examples datasets

To standardize the structure of genomic data, we use the ExpressionSet structure for the examples given in this guide. The ExpressionSet objects are formed mainly by:

A matrix of genetic expression, usually derived from microarray or RNAseq experiments. - Phenotypic data, where we find information on the samples (condition, status, treatment, survival, and other covariates). - Finally, these objects can also contain Annotations and feature Meta-data.

To start testing GSgalgoR, we will use two Breast Cancer datasets. Namely, the UPP and the TRANSBIG datasets. Additionally, we will use PAM50 centroids to perform breast cancer sample classification. The datasets can be accessed from the following Bioconductor packages:


BiocManager::install("breastCancerUPP",version = "devel")
BiocManager::install("breastCancerTRANSBIG",version = "devel")


library(breastCancerTRANSBIG)
library(breastCancerUPP)

Also, some basic packages are needed to run the example in this vignette

library(GSgalgoR)
library(Biobase)
library(genefu)
library(survival)
library(survminer)
library(ggplot2)
data(pam50)

4 Examples

4.1 Loading data

To access the ExpressionSets we use:

data(upp)
Train<- upp
rm(upp)

data(transbig)
Test<- transbig
rm(transbig)

#To access gene expression data
train_expr<- exprs(Train)
test_expr<- exprs(Test)

#To access feature data
train_features<- fData(Train)
test_features<- fData(Test)

#To access clinical data
train_clinic <- pData(Train) 
test_clinic <- pData(Test)

4.2 Data tidying and preparation

Galgo can accept any numeric data, like probe intensity from microarray experiments or RNAseq normalized counts, nevertheless, features are expected to be scaled across the dataset before being plugged in into the Galgo Framework. For PAM50 classification, Gene Symbols are expected, so probesets are mapped into their respective gene symbols. Probesets mapping for multiple genes are expanded while Genes mapped to multiple probes are collapsed selecting the probes with the highest variance for each duplicated gene.

4.2.1 Drop duplicates and NA’s


#Custom function to drop duplicated genes (keep genes with highest variance)

DropDuplicates<- function(eset, map= "Gene.symbol"){

    #Drop NA's
    drop <- which(is.na(fData(eset)[,map]))
    eset <- eset[-drop,]

    #Drop duplicates
    drop <- NULL
    Dup <- as.character(unique(fData(eset)[which(duplicated
            (fData(eset)[,map])),map]))
    Var <- apply(exprs(eset),1,var)
    for(j in Dup){
        pos <- which(fData(eset)[,map]==j)
        drop <- c(drop,pos[-which.max(Var[pos])])
    }

    eset <- eset[-drop,]

    featureNames(eset) <- fData(eset)[,map]
    return(eset)
}

4.2.2 Expand probesets that map for multiple genes


# Custom function to expand probesets mapping to multiple genes
expandProbesets <- function (eset, sep = "///", map="Gene.symbol"){
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    y<- lapply(as.character(fData(eset)[,map]), function(x) strsplit(x,sep))
    eset <- eset[order(sapply(x, length)), ]
    x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
    y<- lapply(as.character(fData(eset)[,map]), function(x) strsplit(x,sep))
    idx <- unlist(sapply(1:length(x), function(i) rep(i,length(x[[i]]))))
    idy <- unlist(sapply(1:length(y), function(i) rep(i,length(y[[i]]))))
    xx <- !duplicated(unlist(x))
    idx <- idx[xx]
    idy <- idy[xx]
    x <- unlist(x)[xx]
    y <- unlist(y)[xx]
    eset <- eset[idx, ]
    featureNames(eset) <- x
    fData(eset)[,map] <- x
    fData(eset)$gene <- y
    return(eset)
}

Train=DropDuplicates(Train)
Train=expandProbesets(Train)
#Drop NAs in survival
Train <- Train[,!is.na(
    survival::Surv(time=pData(Train)$t.rfs,event=pData(Train)$e.rfs))] 

Test=DropDuplicates(Test)
Test=expandProbesets(Test)
#Drop NAs in survival
Test <- 
    Test[,!is.na(survival::Surv(
        time=pData(Test)$t.rfs,event=pData(Test)$e.rfs))] 

#Determine common probes (Genes)
Int= intersect(rownames(Train),rownames(Test))

Train= Train[Int,]
Test= Test[Int,]

identical(rownames(Train),rownames(Test))
#> [1] TRUE

For simplicity and speed, we will create a reduced expression matrix for the examples.


#First we will get PAM50 centroids from genefu package

PAM50Centroids <- pam50$centroids
PAM50Genes <- pam50$centroids.map$probe
PAM50Genes<- featureNames(Train)[ featureNames(Train) %in% PAM50Genes]

#Now we sample 200 random genes from expression matrix

Non_PAM50Genes<- featureNames(Train)[ !featureNames(Train) %in% PAM50Genes]
Non_PAM50Genes <- sample(Non_PAM50Genes,200, replace=FALSE)

reduced_set <- c(PAM50Genes, Non_PAM50Genes)

#Now we get the reduced training and test sets

Train<- Train[reduced_set,]
Test<- Test[reduced_set,]

4.2.3 Rescale expression matrix

Apply robust linear scaling as proposed in paper reference


exprs(Train) <- t(apply(exprs(Train),1,genefu::rescale,na.rm=TRUE,q=0.05))
exprs(Test) <- t(apply(exprs(Test),1,genefu::rescale,na.rm=TRUE,q=0.05))

train_expr <- exprs(Train)
test_expr <- exprs(Test)

4.2.4 Survival Object

The ‘Surv’ object is created by the Surv() function of the survival package. This uses phenotypic data that are contained in the corresponding datasets, accessed by the pData command.

train_clinic <- pData(Train) 
test_clinic <- pData(Test)

train_surv <- survival::Surv(time=train_clinic$t.rfs,event=train_clinic$e.rfs)
test_surv <- survival::Surv(time=test_clinic$t.rfs,event=test_clinic$e.rfs)

4.3 Run galgo()

The main function in this package is galgo(). It accepts an expression matrix and survival object to find robust gene expression signatures related to a given outcome. This function contains some parameters that can be modified, according to the characteristics of the analysis to be performed.

4.3.1 Setting parameters

The principal parameters are:

population: a number indicating the number of solutions in the population of solutions that will be evolved
generations: a number indicating the number of iterations of the galgo algorithm
nCV: number of cross-validation sets
distancetype: character, it can be ‘pearson’ (centered pearson), ‘uncentered’ (uncentered pearson), ‘spearman’ or ‘euclidean’
TournamentSize: a number indicating the size of the tournaments for the selection procedure
period: a number indicating the outcome period to evaluate the RMST

# For testing reasons it is set to a low number but ideally should be above 100
population <- 30 
# For testing reasons it is set to a low number but ideally should be above 150
generations <-15
nCV <- 5                      
distancetype <- "pearson"     
TournamentSize <- 2
period <- 3650

4.3.2 Run Galgo algorithm

set.seed(264)
output <- GSgalgoR::galgo(generations = generations, 
                        population = population, 
                        prob_matrix = train_expr, 
                        OS = train_surv,
                        nCV = nCV, 
                        distancetype = distancetype,
                        TournamentSize = TournamentSize, 
                        period = period)

print(class(output))
#> [1] "galgo.Obj"
#> attr(,"package")
#> [1] "GSgalgoR"

4.3.3 Galgo Object

The output of the galgo() function is an object of type galgo.Obj that has two slots with the elements:

Solutions
ParetoFront.

4.3.3.1 Solutions

Is a l x (n + 5) matrix where n is the number of features evaluated and l is the number of solutions obtained.

The submatrix l x n is a binary matrix where each row represents the chromosome of an evolved solution from the solution population, where each feature can be present (1) or absent (0) in the solution.
Column n+1 represent the k number of clusters for each solutions
Column n+2 shows the SC Fitness
Column n+3 represent Survival Fitness values
Column n+4 shows the solution rank
Column n+5 represent the crowding distance of the solution in the final pareto front

4.3.3.2 ParetoFront

Is a list of length equal to the number of generations run in the algorithm. Each element is a l x 2 matrix where l is the number of solutions obtained and the columns are the SC Fitness and the Survival Fitness values respectively.

For easier interpretation of the galgo.Obj, the output can be transformed to a list or to a data.frame objects.

4.4 to_list() function

This function restructurates a galgo.Obj to a more easy to understand an use list. This output is particularly useful if one wants to select a given solution and use its outputs in a new classifier. The output of type list has a length equals to the number of solutions obtained by the galgo algorithm.

Basically this output is a list of lists, where each element of the output is named after the solution’s name (solution.n, where n is the number assigned to that solution), and inside of it, it has all the constituents for that given solution with the following structure:

solution.n$Genes: A vector of the features included in the solution
solution.n$k: The number of partitions found in that solution
solution.n$SC.Fit: The average silhouette coefficient of the partitions found
solution.n$Surv.Fit: The survival fitnes value
solution.n$Rank: The solution rank
CrowD: The solution crowding distance related to the rest of the solutions

outputList <- to_list(output)
head(names(outputList))
#> [1] "Solution.1" "Solution.2" "Solution.3" "Solution.4" "Solution.5"
#> [6] "Solution.6"

To evaluate the structure of the first solution we can run:

outputList[["Solution.1"]]
#> $Genes
#>   [1] "PHGDH"        "KRT5"         "RRM2"         "SFRP1"        "SLC39A6"     
#>   [6] "BIRC5"        "CDC20"        "CDH3"         "BCL2"         "MMP11"       
#>  [11] "CDC6"         "EXO1"         "MELK"         "KRT17"        "ESR1"        
#>  [16] "MAPT"         "CENPF"        "PGR"          "KRT14"        "KIF2C"       
#>  [21] "GRB7"         "FGFR4"        "MKI67"        "FOXC1"        "NAT1"        
#>  [26] "TYMS"         "MLPH"         "CEP55"        "ACTR3B"       "ORC6L"       
#>  [31] "CCNB1"        "BLVRA"        "MDM2"         "DNAJC11"      "POLD4"       
#>  [36] "TM9SF2"       "SEPT10"       "HNF1B"        "MRPL15"       "APOBEC3A"    
#>  [41] "KIR2DL5A"     "SUCLG1"       "XAGE1D"       "TMEM109"      "NFKBIL2"     
#>  [46] "HLA-DPB1"     "SEMA3A"       "ZNF688"       "TLK2"         "LOC100287328"
#>  [51] "CATSPER2"     "TNNI1"        "GPR89A"       "TAS2R3"       "NR1H4"       
#>  [56] "ANKRD53"      "CA3"          "DHX8"         "LASS2"        "C10orf10"    
#>  [61] "DUX4"         "MED13"        "B4GALT5"      "ADAMTS13"     "C4orf6"      
#>  [66] "IRAK3"        "FKBP4"        "CEP350"       "STATH"        "LOC440248"   
#>  [71] "DGCR5"        "FXYD3"        "CUEDC2"       "CLEC5A"       "RBL1"        
#>  [76] "CPZ"          "EZR"          "APOE"         "PRSS12"       "BBS1"        
#>  [81] "TMOD2"        "PDE1C"        "ARFGAP2"      "ACCN1"        "HSD17B10"    
#>  [86] "SPG21"        "GALNT1"       "RPE65"        "SIGLEC15"     "MT3"         
#>  [91] "PRAMEF2"      "ANXA9"        "AVEN"         "GTF2H3"       "PROM1"       
#>  [96] "C14orf166"    "TNFRSF11B"    "CD1D"         "TTBK2"        "CCDC86"      
#> [101] "SNX16"        "SLC7A10"      "CNKSR1"       "SLC22A11"     "HDHD3"       
#> [106] "BAT1"         "KPNA4"        "ITIH2"        "GIMAP4"       "TNFAIP2"     
#> [111] "AZU1"         "SLC25A13"     "PKD2L1"       "STARD13"      "FAM102A"     
#> [116] "SLC22A13"     "ACN9"         "MXI1"         "RAP2A"        "SSBP1"       
#> [121] "PRKCB"        "ACSL5"        "RAB40B"       "GMEB2"        "PLIN1"       
#> [126] "UGT1A6"       "ANAPC13"      "INSL3"        "KIAA0319"     "FPR1"        
#> [131] "CST3"         "PP14571"      "KIN"          "GINS3"        "SLC6A6"      
#> [136] "ARMC6"        "PLXNB1"       "BHMT"         "SMARCD2"      "SAMHD1"      
#> [141] "KIAA0240"     "KIF3B"        "GPATCH8"      "HNRNPM"       "RPL27"       
#> [146] "RPS11"        "PRSS1"        "VNN1"         "DCP2"         "AFP"         
#> [151] "ZNF74"        "AHNAK"        "ABCE1"        "ADAM15"       "UBA7"        
#> [156] "BAMBI"        "RPL12P11"     "DNAJC17"      "CRYBA2"       "LOC729353"   
#> 
#> $k
#> [1] 7
#> 
#> $SC.Fit
#> [1] 0.035148
#> 
#> $Surv.Fit
#> [1] 922.3855
#> 
#> $rank
#> [1] 1
#> 
#> $CrowD
#> [1] Inf

4.5 to_dataframe() function

The current function restructures a galgo.Obj to a more easy to understand an use data.frame. The output data frame has m x n dimensions, were the rownames (m) are the solutions obtained by the galgo algorithm. The columns has the following structure:

Genes: The features included in each solution in form of a list
k: The number of partitions found in that solution
SC.Fit: The average silhouette coefficient of the partitions found
Surv.Fit: The survival fitness value
Rank: The solution rank
CrowD: The solution crowding distance related to the rest of the solutions

outputDF <- to_dataframe(output)
head(outputDF)
#>                    Genes k     SC.Fit Surv.Fit Rank     CrowD
#> Solutions.1 PHGDH, K.... 7 0.03514800 922.3855    1       Inf
#> Solutions.2 PHGDH, M.... 2 0.18375217 436.0488    1       Inf
#> Solutions.3 PHGDH, K.... 5 0.04822049 710.0283    1 0.8032286
#> Solutions.4 SLC39A6,.... 2 0.14718450 696.3643    1 0.8030291
#> Solutions.5 BIRC5, C.... 2 0.17779010 612.2822    1 0.3066921
#> Solutions.6 BIRC5, C.... 2 0.17187635 647.3320    1 0.2754388

4.6 plot_pareto()

Once we obtain the galgo.obj from the output of galgo() we can plot the obtained Pareto front and see how it evolved trough the tested number of generations

plot_pareto(output)

5 Case study

Breast cancer (BRCA) is the most common neoplasm in women to date and one of the best studied cnacer types. Currently, numerous molecular alteration for this type of cancer are well known and many transcriptomic signatures have been developed for this type of cancer. In this regards, Perou et al. proposed one of the first molecular subtype classification according to transcriptomic profiles of the tumor, which recapitulates naturally-occurring gene expression patterns that encompass different functional pathways and patient outcomes. These subtypes, (LumA, LumB, Basal-like, HER2 and Normal-Like) have a strong overlap with the classical histopathological classification of BRCA tumors and might affect decision making when used to decided chemotherapy in certain cases.

5.1 Data Preprocessing

To evaluate Galgo’s performance along with PAM50 classification, we will use the two already scaled and reduced BRCA gene expression datasets and will compare Galgo performance with the widely used intrinsic molecular subtype PAM50 classification. Galgo performs feature selection by design, so this step is not strictly necessary to use galgoR (although feature selection might fasten GSgalgoRruns), nevertheless, appropriate gene expression scaling is critical when running GSgalgoR.

5.2 Breast cancer classification

The scaled expression values of each patient are compared with the prototypical centroids using Pearson’s correlation coefficient and the closest centroid to each patient is used to assign the corresponding labels.

#The reduced UPP dataset will be used as training set 
train_expression <- exprs(Train) 
train_clinic<- pData(Train)
train_features<- fData(Train)
train_surv<- survival::Surv(time=train_clinic$t.rfs,event=train_clinic$e.rfs)

#The reduced TRANSBIG dataset will be used as test set 

test_expression <- exprs(Test) 
test_clinic<- pData(Test)
test_features<- fData(Test)
test_surv<- survival::Surv(time=test_clinic$t.rfs,event=test_clinic$e.rfs)


#PAM50 centroids
centroids<- pam50$centroids
#Extract features from both data.frames
inBoth<- Reduce(intersect, list(rownames(train_expression),rownames(centroids)))

#Classify samples 

PAM50_train<- cluster_classify(train_expression[inBoth,],centroids[inBoth,],
                            method = "spearman")
table(PAM50_train)
#> PAM50_train
#>  1  2  3  4  5 
#> 22 30 94 73 15

PAM50_test<- cluster_classify(test_expression[inBoth,],centroids[inBoth,],
                            method = "spearman")
table(PAM50_test)
#> PAM50_test
#>  1  2  3  4  5 
#> 45 26 80 44  3

# Classify samples using genefu
#annot<- fData(Train)
#colnames(annot)[3]="Gene.Symbol"
#PAM50_train<- molecular.subtyping(sbt.model = "pam50",
#         data = t(train_expression), annot = annot,do.mapping = TRUE)

Once the patients are classified according to their closest centroids, we can now evaluate the survival curves for the different types in each of the datasets

5.2.1 Survival of UPP patients

surv_formula <- 
    as.formula("Surv(train_clinic$t.rfs,train_clinic$e.rfs)~ PAM50_train")
tumortotal1 <- surv_fit(surv_formula,data=train_clinic)
tumortotal1diff <- survdiff(surv_formula)
tumortotal1pval<- pchisq(tumortotal1diff$chisq, length(tumortotal1diff$n) - 1,
                         lower.tail = FALSE) 

p<-ggsurvplot(tumortotal1,
            data=train_clinic,
            risk.table=TRUE,
            pval=TRUE,
            palette="dark2",
            title="UPP breast cancer \n PAM50 subtypes survival",
            surv.scale="percent",
            conf.int=FALSE, 
            xlab="time (days)", 
            ylab="survival(%)", 
            xlim=c(0,3650),
            break.time.by = 365, 
            ggtheme = theme_minimal(), 
            risk.table.y.text.col = TRUE, 
            risk.table.y.text = FALSE,censor=FALSE)
print(p)

5.2.2 Survival of TRANSBIG patients

surv_formula <- 
    as.formula("Surv(test_clinic$t.rfs,test_clinic$e.rfs)~ PAM50_test")
tumortotal2 <- surv_fit(surv_formula,data=test_clinic)
tumortotal2diff <- survdiff(surv_formula)
tumortotal2pval<- pchisq(tumortotal2diff$chisq, length(tumortotal2diff$n) - 1,
                        lower.tail = FALSE) 

p<-ggsurvplot(tumortotal2,
            data=test_clinic,
            risk.table=TRUE,
            pval=TRUE,
            palette="dark2",
            title="TRANSBIG breast cancer \n PAM50 subtypes survival",
            surv.scale="percent",
            conf.int=FALSE,
            xlab="time (days)",
            ylab="survival(%)",
            xlim=c(0,3650),
            break.time.by = 365,
            ggtheme = theme_minimal(),
            risk.table.y.text.col = TRUE,
            risk.table.y.text = FALSE,
            censor=FALSE)
print(p)

5.3 Find breast cancer gene signatures with GSgalgoR

Now we run Galgo to find cohesive and clinically meaningful signatures for BRCA using UPP data as training set and TRANSBIG data as test set

5.3.1 Set configuration parameters

population <- 15             
generations <-5             
nCV <- 5                      
distancetype <- "pearson"     
TournamentSize <- 2
period <- 3650

Run Galgo on the training set

output= GSgalgoR::galgo(generations = generations,
                    population = population,
                    prob_matrix = train_expression,
                    OS=train_surv,
                    nCV= nCV, 
                    distancetype=distancetype,
                    TournamentSize=TournamentSize,
                    period=period)
print(class(output))

5.4 Analyzing Galgo results

5.4.1 Pareto front

plot_pareto(output)

5.4.2 Summary of the results


output_df<- to_dataframe(output)
NonDom_solutions<- output_df[output_df$Rank==1,]

# N of non-dominated solutions 
nrow(NonDom_solutions)
#> [1] 7

# N of partitions found
table(NonDom_solutions$k)
#> 
#> 2 4 6 9 
#> 3 1 2 1

#Average N of genes per signature
mean(unlist(lapply(NonDom_solutions$Genes,length)))
#> [1] 138.5714

#SC range
range(NonDom_solutions$SC.Fit)
#> [1] 0.02702423 0.15655657

# Survival fitnesss range
range(NonDom_solutions$Surv.Fit)
#> [1] 251.7850 849.6915

5.4.3 Select best performing solutions

Now we select the best performing solutions for each number of partitions (k) according to C.Index


RESULT<- non_dominated_summary(output=output,
                            OS=train_surv, 
                            prob_matrix= train_expression,
                            distancetype =distancetype 
                            )

best_sol=NULL
for(i in unique(RESULT$k)){
    best_sol=c(
    best_sol,
    RESULT[RESULT$k==i,"solution"][which.max(RESULT[RESULT$k==i,"C.Index"])])
}

print(best_sol)
#> [1] "Solutions.1" "Solutions.3" "Solutions.4" "Solutions.7"

5.4.4 Create prototypic centroids

Now we create the prototypic centroids of the selected solutions

CentroidsList <- create_centroids(output, 
                                solution_names = best_sol,
                                trainset = train_expression)

5.5 Test Galgo signatures in a test set

We will test the Galgo signatures found with the UPP training set in an independent test set (TRANSBIG)

5.5.1 Classify train and test set into GSgalgoR subtypes


train_classes<- classify_multiple(prob_matrix=train_expression,
                                centroid_list= CentroidsList, 
                                distancetype = distancetype)

test_classes<- classify_multiple(prob_matrix=test_expression,
                                centroid_list= CentroidsList, 
                                distancetype = distancetype)

5.5.2 Calculate train and test set C.Index

To calculate the train and test C.Index, the risk coefficients are calculated for each subclass in the training set and then are used to predict the risk of the different groups in the test set. This is particularly important for signatures with high number of partitions, were the survival differences of different groups might overlap and change their relative order, which is of great importance in the C.Index calculation.


Prediction.models<- list()

for(i in best_sol){

    OS<- train_surv
    predicted_class<- as.factor(train_classes[,i])
    predicted_classdf <- as.data.frame(predicted_class)
    colnames(predicted_classdf)<- i
    surv_formula <- as.formula(paste0("OS~ ",i))
    coxsimple=coxph(surv_formula,data=predicted_classdf)
    Prediction.models[[i]]<- coxsimple
}

5.5.3 Calculate C.Index for training and test set using the prediction models


C.indexes<- data.frame(train_CI=rep(NA,length(best_sol)),
                    test_CI=rep(NA,length(best_sol)))
rownames(C.indexes)<- best_sol

for(i in best_sol){
    predicted_class_train<- as.factor(train_classes[,i])
    predicted_class_train_df <- as.data.frame(predicted_class_train)
    colnames(predicted_class_train_df)<- i
    CI_train<- 
        concordance.index(predict(Prediction.models[[i]],
                                predicted_class_train_df),
                                surv.time=train_surv[,1],
                                surv.event=train_surv[,2],
                                outx=FALSE)$c.index
    C.indexes[i,"train_CI"]<- CI_train
    predicted_class_test<- as.factor(test_classes[,i])
    predicted_class_test_df <- as.data.frame(predicted_class_test)
    colnames(predicted_class_test_df)<- i
    CI_test<- 
        concordance.index(predict(Prediction.models[[i]],
                                predicted_class_test_df),
                                surv.time=test_surv[,1],
                                surv.event=test_surv[,2],
                                outx=FALSE)$c.index
    C.indexes[i,"test_CI"]<- CI_test
    }

print(C.indexes)
#>              train_CI   test_CI
#> Solutions.1 0.6395302 0.5682051
#> Solutions.3 0.6072689 0.5483629
#> Solutions.4 0.5887690 0.5661538
#> Solutions.7 0.6292469 0.5503748

best_signature<- best_sol[which.max(C.indexes$test_CI)]

print(best_signature)
#> [1] "Solutions.1"

5.5.4 Evaluate prediction survival of Galgo signatures

We test best galgo signature with training and test sets


train_class <- train_classes[,best_signature]

surv_formula <- 
    as.formula("Surv(train_clinic$t.rfs,train_clinic$e.rfs)~ train_class")
tumortotal1 <- surv_fit(surv_formula,data=train_clinic)
tumortotal1diff <- survdiff(surv_formula)
tumortotal1pval<- pchisq(tumortotal1diff$chisq,
                        length(tumortotal1diff$n) - 1,
                        lower.tail = FALSE) 

p<-ggsurvplot(tumortotal1,
            data=train_clinic,
            risk.table=TRUE,pval=TRUE,palette="dark2",
            title="UPP breast cancer \n Galgo subtypes survival",
            surv.scale="percent",
            conf.int=FALSE, xlab="time (days)", 
            ylab="survival(%)", xlim=c(0,3650),
            break.time.by = 365,
            ggtheme = theme_minimal(), 
            risk.table.y.text.col = TRUE, 
            risk.table.y.text = FALSE,censor=FALSE)
print(p)


test_class <- test_classes[,best_signature]

surv_formula <- 
    as.formula("Surv(test_clinic$t.rfs,test_clinic$e.rfs)~ test_class")
tumortotal1 <- surv_fit(surv_formula,data=test_clinic)
tumortotal1diff <- survdiff(surv_formula)
tumortotal1pval<- pchisq(tumortotal1diff$chisq,
                        length(tumortotal1diff$n) - 1,
                        lower.tail = FALSE) 

p<-ggsurvplot(tumortotal1,
            data=test_clinic,
            risk.table=TRUE,
            pval=TRUE,palette="dark2",
            title="TRANSBIG breast cancer \n Galgo subtypes survival",
            surv.scale="percent",
            conf.int=FALSE, 
            xlab="time (days)",
            ylab="survival(%)",
            xlim=c(0,3650),
            break.time.by = 365, 
            ggtheme = theme_minimal(), 
            risk.table.y.text.col = TRUE,
            risk.table.y.text = FALSE,
            censor=FALSE)
print(p)

5.6 Comparison of Galgo vs PAM50 classifier

Compare PAM50 classification vs Galgo classification in the TRANSBIG (test) dataset


surv_formula1 <- 
    as.formula("Surv(test_clinic$t.rfs,test_clinic$e.rfs)~ test_class")
tumortotal1 <- surv_fit(surv_formula1,data=test_clinic)
tumortotal1diff <- survdiff(surv_formula1)
tumortotal1pval<- pchisq(tumortotal1diff$chisq,
                        length(tumortotal1diff$n) - 1,
                        lower.tail = FALSE) 

surv_formula2 <- 
    as.formula("Surv(test_clinic$t.rfs,test_clinic$e.rfs)~ PAM50_test")
tumortotal2 <- surv_fit(surv_formula2,data=test_clinic)
tumortotal2diff <- survdiff(surv_formula2)
tumortotal2pval<- pchisq(tumortotal1diff$chisq,
                        length(tumortotal2diff$n) - 1,
                        lower.tail = FALSE) 

SURV=list(GALGO=tumortotal1,PAM50=tumortotal2 )
COLS=c(1:8,10)
par(cex=1.35, mar=c(3.8, 3.8, 2.5, 2.5) + 0.1)
p=ggsurvplot(SURV,
            combine=TRUE,
            data=test_clinic,
            risk.table=TRUE,
            pval=TRUE,
            palette="dark2",
            title="Galgo vs. PAM50 subtypes \n BRCA survival comparison",
            surv.scale="percent",
            conf.int=FALSE,
            xlab="time (days)",
            ylab="survival(%)",
            xlim=c(0,period),
            break.time.by = 365, 
            ggtheme = theme_minimal(),
            risk.table.y.text.col = TRUE,
            risk.table.y.text = FALSE,
            censor=FALSE)
print(p)

6 Session info

sessionInfo()
#> R version 4.5.0 beta (2025-04-02 r88102)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] survminer_0.5.0             ggpubr_0.6.0               
#>  [3] ggplot2_3.5.2               genefu_2.41.0              
#>  [5] AIMS_1.41.0                 e1071_1.7-16               
#>  [7] iC10_2.0.2                  biomaRt_2.65.0             
#>  [9] survcomp_1.59.0             prodlim_2024.06.25         
#> [11] survival_3.8-3              Biobase_2.69.0             
#> [13] BiocGenerics_0.55.0         generics_0.1.3             
#> [15] GSgalgoR_1.19.0             breastCancerUPP_1.45.0     
#> [17] breastCancerTRANSBIG_1.45.0 BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] jsonlite_2.0.0          magrittr_2.0.3          magick_2.8.6           
#>   [4] SuppDists_1.1-9.9       farver_2.1.2            rmarkdown_2.29         
#>   [7] vctrs_0.6.5             memoise_2.0.1           tinytex_0.57           
#>  [10] rstatix_0.7.2           htmltools_0.5.8.1       progress_1.2.3         
#>  [13] curl_6.2.2              broom_1.0.8             Formula_1.2-5          
#>  [16] sass_0.4.10             parallelly_1.43.0       KernSmooth_2.23-26     
#>  [19] bslib_0.9.0             httr2_1.1.2             impute_1.83.0          
#>  [22] zoo_1.8-14              cachem_1.1.0            commonmark_1.9.5       
#>  [25] iterators_1.0.14        lifecycle_1.0.4         pkgconfig_2.0.3        
#>  [28] Matrix_1.7-3            R6_2.6.1                fastmap_1.2.0          
#>  [31] GenomeInfoDbData_1.2.14 future_1.40.0           digest_0.6.37          
#>  [34] colorspace_2.1-1        AnnotationDbi_1.71.0    S4Vectors_0.47.0       
#>  [37] nsga2R_1.1              RSQLite_2.3.9           labeling_0.4.3         
#>  [40] filelock_1.0.3          km.ci_0.5-6             httr_1.4.7             
#>  [43] abind_1.4-8             compiler_4.5.0          proxy_0.4-27           
#>  [46] bit64_4.6.0-1           withr_3.0.2             doParallel_1.0.17      
#>  [49] backports_1.5.0         carData_3.0-5           DBI_1.2.3              
#>  [52] ggsignif_0.6.4          lava_1.8.1              rappdirs_0.3.3         
#>  [55] tools_4.5.0             iC10TrainingData_2.0.1  future.apply_1.11.3    
#>  [58] bootstrap_2019.6        glue_1.8.0              gridtext_0.1.5         
#>  [61] grid_4.5.0              cluster_2.1.8.1         gtable_0.3.6           
#>  [64] KMsurv_0.1-5            class_7.3-23            tidyr_1.3.1            
#>  [67] data.table_1.17.0       hms_1.1.3               xml2_1.3.8             
#>  [70] car_3.1-3               XVector_0.49.0          markdown_2.0           
#>  [73] foreach_1.5.2           pillar_1.10.2           stringr_1.5.1          
#>  [76] limma_3.65.0            splines_4.5.0           ggtext_0.1.2           
#>  [79] dplyr_1.1.4             BiocFileCache_2.17.0    lattice_0.22-7         
#>  [82] bit_4.6.0               tidyselect_1.2.1        Biostrings_2.77.0      
#>  [85] knitr_1.50              gridExtra_2.3           litedown_0.7           
#>  [88] bookdown_0.43           IRanges_2.43.0          pamr_1.57              
#>  [91] stats4_4.5.0            xfun_0.52               statmod_1.5.0          
#>  [94] stringi_1.8.7           UCSC.utils_1.5.0        yaml_2.3.10            
#>  [97] evaluate_1.0.3          codetools_0.2-20        tibble_3.2.1           
#> [100] BiocManager_1.30.25     cli_3.6.4               survivalROC_1.0.3.1    
#> [103] xtable_1.8-4            munsell_0.5.1           jquerylib_0.1.4        
#> [106] survMisc_0.5.6          Rcpp_1.0.14             GenomeInfoDb_1.45.0    
#> [109] rmeta_3.0               globals_0.16.3          dbplyr_2.5.0           
#> [112] png_0.1-8               parallel_4.5.0          mco_1.17               
#> [115] blob_1.2.4              prettyunits_1.2.0       mclust_6.1.1           
#> [118] listenv_0.9.1           scales_1.3.0            purrr_1.0.4            
#> [121] crayon_1.5.3            rlang_1.1.6             KEGGREST_1.49.0