1 Introduction

In this example, we will use clinical data and three types of ’omic data for binary classification of breast tumours. We also use several strategies and definitions of similarity to create features.

For this we will use data from the The Cancer Genome Atlas, and will integrate four types of -omic data:

gene expression from Agilent mRNA microarrays
DNA methylation (Illumina HumanMethylation 27K microarrays))
proteomic measures from reverse-phase protein arrays, and
miRNA sequencing

Figure 1 shows the rules for converting patient data into similarity networks, which serve as units of input (or “features”) for the model.

Gene expression: Features are defined at the level of pathways; i.e. a feature groups genes corresponding to the pathway. Similarity is defined as pairwise Pearson correlation
Clinical variables: Each variable is its own feature and similarity is defined as normalized difference.
Proteomic and methylation data: Features are defined at the level of the entire data layer; a single feature is created for all of proteomic data, and the same for methylation. Similarity is defined by pairwise Pearson correlation

Figure 1: Predictor design

The overall workflow for building and testing the predictor is shown in Figure 2.

We start with a dataset with 295 tumours. 90% of the samples are used to build the predictor (buildPredictor()) while 10% are randomly subsampled and set aside for independent validation (subsampleValidationData()). Building the predictor involves:

Splitting samples into 80% training and 20% test samples (proportions can be changed by setting parameters for buildPredictor())
Running feature selection using the training samples, so that features are scored from 0 to some user-specified max value.
Features passing cutoff are used to classify the 20% test samples.

This process is repeated with multiple random train/test splits to generate an average performance of the model on training data. Features with a consistent high score across all train/test splits are selected to build a final model.

The model is then evaluated on the independent validation set (here, the held-out 10% of samples), using only the consistently high scoring features (i.e. features passing selection). This steps ascertains the model’s test set performance.

Figure 2: Workflow

In this example, we build a minimal predictor to ensure the vignette builds in a feasible time. Here we use two train/test splits, score features from 0-2, call features scoring >0 selected features within a given train/test split. For building the final model, we choose features that score 1 or higher in at least half the splits. These are not realistic parameters because they do not sufficiently resample training data to ensure generalizability.

In practice reasonable values include 10 train/test splits, features scored at least out of 10, and the final model being built with features scoring 7 or higher. Try several designs to see which consistently generalizes to an independent test set (hyperparameter tuning).

2 Setup

Load the netDx package.

suppressWarnings(suppressMessages(require(netDx)))

3 Get and Prepare Data

In this example, we use curated multi-modal data from The Cancer Genome Atlas, gotten from the BioConductor curatedTCGAData package. Data for all cancer types profiled in TCGA are available through this package; see this tutorial for details.

suppressMessages(library(curatedTCGAData))

Let’s take a look at the available data for breast cancer, without downloading any (set dry.run=TRUE). Note that from BioC 3.13 onwards, users may fetch one of two versions of TCGA data.

curatedTCGAData(diseaseCode="BRCA", assays="*",dry.run=TRUE,version="1.1.38")

## See '?curatedTCGAData' for 'diseaseCode' and 'assays' inputs

## Querying EH with: BRCA_CNASeq-20160128

## Querying EH with: BRCA_CNASNP-20160128

## Querying EH with: BRCA_CNVSNP-20160128

## Querying EH with: BRCA_GISTIC_AllByGene-20160128

## Querying EH with: BRCA_GISTIC_Peaks-20160128

## Querying EH with: BRCA_GISTIC_ThresholdedByGene-20160128

## Querying EH with: BRCA_Methylation_methyl27-20160128_assays

## Querying EH with: BRCA_Methylation_methyl27-20160128_se

## Querying EH with: BRCA_Methylation_methyl450-20160128_assays

## Querying EH with: BRCA_Methylation_methyl450-20160128_se

## Querying EH with: BRCA_miRNASeqGene-20160128

## Querying EH with: BRCA_mRNAArray-20160128

## Querying EH with: BRCA_Mutation-20160128

## Querying EH with: BRCA_RNASeq2GeneNorm-20160128

## Querying EH with: BRCA_RNASeqGene-20160128

## Querying EH with: BRCA_RPPAArray-20160128

##     ah_id                                      title file_size
## 1   EH584                       BRCA_CNASeq-20160128      0 Mb
## 2   EH585                       BRCA_CNASNP-20160128    9.8 Mb
## 3   EH586                       BRCA_CNVSNP-20160128    2.8 Mb
## 4   EH588             BRCA_GISTIC_AllByGene-20160128    1.3 Mb
## 5  EH2121                 BRCA_GISTIC_Peaks-20160128      0 Mb
## 6   EH589     BRCA_GISTIC_ThresholdedByGene-20160128    0.4 Mb
## 7  EH2122  BRCA_Methylation_methyl27-20160128_assays   63.2 Mb
## 8  EH2123      BRCA_Methylation_methyl27-20160128_se    0.4 Mb
## 9  EH2124 BRCA_Methylation_methyl450-20160128_assays 2613.2 Mb
## 10 EH2125     BRCA_Methylation_methyl450-20160128_se    6.1 Mb
## 11  EH593                 BRCA_miRNASeqGene-20160128    0.6 Mb
## 12  EH594                    BRCA_mRNAArray-20160128   27.3 Mb
## 13  EH595                     BRCA_Mutation-20160128    4.5 Mb
## 14  EH596              BRCA_RNASeq2GeneNorm-20160128   64.5 Mb
## 15  EH597                   BRCA_RNASeqGene-20160128     30 Mb
## 16  EH598                    BRCA_RPPAArray-20160128    1.6 Mb
##                    rdataclass rdatadateadded rdatadateremoved
## 1            RaggedExperiment     2017-10-10             <NA>
## 2            RaggedExperiment     2017-10-10             <NA>
## 3            RaggedExperiment     2017-10-10             <NA>
## 4        SummarizedExperiment     2017-10-10             <NA>
## 5  RangedSummarizedExperiment     2019-01-09             <NA>
## 6        SummarizedExperiment     2017-10-10             <NA>
## 7        SummarizedExperiment     2019-01-09             <NA>
## 8        SummarizedExperiment     2019-01-09             <NA>
## 9            RaggedExperiment     2019-01-09             <NA>
## 10       SummarizedExperiment     2019-01-09             <NA>
## 11       SummarizedExperiment     2017-10-10             <NA>
## 12       SummarizedExperiment     2017-10-10             <NA>
## 13           RaggedExperiment     2017-10-10             <NA>
## 14       SummarizedExperiment     2017-10-10             <NA>
## 15       SummarizedExperiment     2017-10-10             <NA>
## 16       SummarizedExperiment     2017-10-10             <NA>

In this call we fetch only the gene expression, proteomic and methylation data; setting dry.run=FALSE initiates the fetching of the data.

brca <- suppressMessages(
   curatedTCGAData("BRCA",
               c("mRNAArray","miRNA*","Methylation_methyl27*"),
    dry.run=FALSE,version="1.1.38"))

This call returns a MultiAssayExperiment object. Recall that this is a container for storing multiple assays performed on the same set of samples. See this tutorial to learn more.

3.1 MultiAssayExperiment object

Skip this section if you are familiar with MultiAssayExperiment objects.

Let’s briefly explore the brca MultiAssayExperiment object.

brca

## A MultiAssayExperiment object of 3 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 3:
##  [1] BRCA_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 849 columns
##  [2] BRCA_mRNAArray-20160128: SummarizedExperiment with 17814 rows and 590 columns
##  [3] BRCA_Methylation_methyl27-20160128: SummarizedExperiment with 27578 rows and 343 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

assays() returns a list with all -omic data associated with this object.

summary(assays(brca))

## [1] "List object of length 3 with 0 metadata columns"

names() shows the datatypes in each slot of assays():

names(assays(brca))

## [1] "BRCA_miRNASeqGene-20160128"         "BRCA_mRNAArray-20160128"           
## [3] "BRCA_Methylation_methyl27-20160128"

So miRNA data is in slot 1, gene expression in slot 2, etc.,

We can subset the data to see what it looks like. Let’s do that for the miRNA data, looking at just the first five measures

mir <- assays(brca)[["BRCA_miRNASeqGene-20160128"]]
head(mir[,1:5])

##              TCGA-3C-AAAU-01A-11R-A41G-13 TCGA-3C-AALI-01A-11R-A41G-13
## hsa-let-7a-1                        95618                        49201
## hsa-let-7a-2                       189674                        98691
## hsa-let-7a-3                        96815                        49035
## hsa-let-7b                         264034                       148591
## hsa-let-7c                           3641                         5095
## hsa-let-7d                           4333                         3263
##              TCGA-3C-AALJ-01A-31R-A41G-13 TCGA-3C-AALK-01A-11R-A41G-13
## hsa-let-7a-1                        75342                        57278
## hsa-let-7a-2                       150472                       114320
## hsa-let-7a-3                        76206                        57540
## hsa-let-7b                          99938                       164553
## hsa-let-7c                           5799                        18464
## hsa-let-7d                           5658                         2114
##              TCGA-4H-AAAK-01A-12R-A41G-13
## hsa-let-7a-1                        67196
## hsa-let-7a-2                       134563
## hsa-let-7a-3                        67607
## hsa-let-7b                         136918
## hsa-let-7c                          20429
## hsa-let-7d                           2162

Patient metadata is contained in the colData() slot. Rows contain data for each patient and columns contain measures such as clinical characteristics:

pheno <- colData(brca)
colnames(pheno)[1:20]

##  [1] "patientID"                           
##  [2] "years_to_birth"                      
##  [3] "vital_status"                        
##  [4] "days_to_death"                       
##  [5] "days_to_last_followup"               
##  [6] "tumor_tissue_site"                   
##  [7] "pathologic_stage"                    
##  [8] "pathology_T_stage"                   
##  [9] "pathology_N_stage"                   
## [10] "pathology_M_stage"                   
## [11] "gender"                              
## [12] "date_of_initial_pathologic_diagnosis"
## [13] "days_to_last_known_alive"            
## [14] "radiation_therapy"                   
## [15] "histological_type"                   
## [16] "number_of_lymph_nodes"               
## [17] "race"                                
## [18] "ethnicity"                           
## [19] "admin.bcr"                           
## [20] "admin.day_of_dcc_upload"

head(pheno[,1:5])

## DataFrame with 6 rows and 5 columns
##                 patientID years_to_birth vital_status days_to_death
##               <character>      <integer>    <integer>     <integer>
## TCGA-A1-A0SB TCGA-A1-A0SB             70            0            NA
## TCGA-A1-A0SD TCGA-A1-A0SD             59            0            NA
## TCGA-A1-A0SE TCGA-A1-A0SE             56            0            NA
## TCGA-A1-A0SF TCGA-A1-A0SF             54            0            NA
## TCGA-A1-A0SG TCGA-A1-A0SG             61            0            NA
## TCGA-A1-A0SH TCGA-A1-A0SH             39            0            NA
##              days_to_last_followup
##                          <integer>
## TCGA-A1-A0SB                   259
## TCGA-A1-A0SD                   437
## TCGA-A1-A0SE                  1321
## TCGA-A1-A0SF                  1463
## TCGA-A1-A0SG                   434
## TCGA-A1-A0SH                  1437

3.2 Prepare data

This next code block prepares the TCGA data. This includes:

removing duplicate samples
reformatting patient IDs (e.g. removing spaces and hyphens)
creating an ID column in colData(brca), which contains unique patient IDs
creating a STATUS column in colData(brca) which contains the patient labels (i.e what we want netDx to classify).

In practice you would prepare the dataset once and save it to file, then separately load it before running netDx; i.e. decouple data processing and running the predictor. The data processing code has been moved into a supporting file, prepare_data.R. You can explore it after the lab to see how some things are achieved (e.g. removing duplicate samples). For now, let’s just run it.

source("prepare_data.R")
brca <- prepareData(brca,setBinary=TRUE)

## harmonizing input:
##   removing 28 sampleMap rows with 'colname' not in colnames of experiments

## harmonizing input:
##   removing 44 sampleMap rows with 'colname' not in colnames of experiments

## harmonizing input:
##   removing 19 sampleMap rows with 'colname' not in colnames of experiments

The important thing is to create ID and STATUS columns in the sample metadata slot. netDx uses these to get the patient identifiers and labels, respectively.

pID <- colData(brca)$patientID
colData(brca)$ID <- pID

3.3 Holdout validation set

The function subSampleValidationData partitions the TCGA data into two smaller datasets: a training and holdout validation dataset. This is to facilitate model validation after an initial model is built by the netDx algorithm - netDx will train a model on the training dataset through the buildPredictor() call, and this model can validate the held-out validation dataset through the predict() function call.

set.seed(123)
dsets <- subsampleValidationData(brca,pctValidation=0.1)
brca <- dsets$trainMAE
holdout <- dsets$validationMAE

4 Rules to create features (patient similarity networks)

Now let’s set up the data for input to netDx.

netDx allows the user to define how data is converted into patient similarity networks (or PSNs), which are the features that go into the model. This is done specifically by telling the model how to:

group different types of data and
define similarity for each of these (e.g. Pearson correlation, normalized difference, etc.,).

The relevant input parameters are:

groupList: sets of input data that would correspond to individual networks (e.g. genes grouped into pathways)
sims: a list specifying similarity metrics for each data layer

4.0.1 groupList: Grouping variables to define features

What is this: groupList tells netDx how to group measures within a layer, to generate a PSN. Measures could be individual genes, proteins, CpG bases (in DNA methylation data), clinical variables, etc.,

In this simple example we just create a single PSN for each datatype, containing all measures from that datatype.

In the code below, we fetch pathway definitions for January 2021 from https://downloads.res.oicr.on.ca/pailab/ and group gene expression data by pathways. To keep this example short, we limit to only three pathways, but in practice you would use all pathways meeting a size criterion; e.g. those containing between 10 and 500 genes (see the MIN_SIZE and MAX_SIZE parameters of readPathways()).

groupList <- list()

# genes in mRNA data are grouped by pathways
pathFile <- sprintf("%s/extdata/pathway_ex3.gmt", path.package("netDx"))
pathList <- suppressMessages(readPathways(pathFile))
groupList[["BRCA_mRNAArray-20160128"]] <- pathList

Let’s take a look at groupList. Here is the first tier, which currently only has gene expression data. You can see that groupList has three features for gene expression data (Length is 3).

summary(groupList)

##                         Length Class  Mode
## BRCA_mRNAArray-20160128 3      -none- list

Now we look at what comprises the pathway-features. Let’s look at the names of the pathways:

names(groupList[["BRCA_mRNAArray-20160128"]])

## [1] "STEARATE_BIOSYNTHESIS_I__ANIMALS_"      
## [2] "PUTRESCINE_DEGRADATION_III"             
## [3] "TRYPTOPHAN_DEGRADATION_III__EUKARYOTIC_"

How many genes are in the first pathway? Take a look at the genes in the pathway, using head()

length(groupList[["BRCA_mRNAArray-20160128"]][[1]])

## [1] 13

head(groupList[["BRCA_mRNAArray-20160128"]][[1]])

## [1] "ELOVL1" "ACOT7"  "ACSL1"  "ACSL5"  "ELOVL6" "ACSL4"

For clinical data, we do not group variables. Rather, we create one feature each for two variables:

groupList[["clinical"]] <- list(
    age="patient.age_at_initial_pathologic_diagnosis",
      stage="STAGE"
)

For miRNA sequencing, methylation, and proteomic data we create one feature each, where each feature contains all measures for that data type.

tmp <- list(rownames(experiments(brca)[[1]]));
names(tmp) <- names(brca)[1]
groupList[[names(brca)[[1]]]] <- tmp

tmp <- list(rownames(experiments(brca)[[2]]));
names(tmp) <- names(brca)[2]
groupList[[names(brca)[2]]] <- tmp

tmp <- list(rownames(experiments(brca)[[3]]));
names(tmp) <- names(brca)[3]
groupList[[names(brca)[3]]] <- tmp

4.0.2 `sims`: Define patient similarity for each network

What is this: sims is used to define similarity metrics for each layer. This is done by providing a single list - here, sims - that specifies the choice of similarity metric to use for each data layer. The names() for this list must match those in groupList. The corresponding value can either be a character if specifying a built-in similarity function, or a function. The latter is used if the user wishes to specify a custom similarity function.

The current available options for built-in similarity measures are:

pearsonCorr: Pearson correlation (n>5 measures in set)
normDiff: normalized difference (single measure such as age)
avgNormDiff: average normalized difference (small number of measures)
sim.pearscale: Pearson correlation followed by exponential scaling
sim.eucscale: Euclidean distance followed by exponential scaling

In this example, we choose Pearson correlation similarity for all data layers except for the single-variable features in the clinical layer. For that we use normalized difference.

sims <- list(
  a="pearsonCorr",
  b="normDiff",
  c="pearsonCorr",
  d="pearsonCorr"
  )

# map layer names to sims
names(sims) <- names(groupList)

5 Build predictor

Now we’re ready to train our model. netDx uses parallel processing to speed up compute time. Let’s use 75% available cores on the machine for this example. netDx also throws an error if provided an output directory that already has content, so let’s clean that up as well.

nco <- round(parallel::detectCores()*0.75) # use 75% available cores
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))

## Using 54 of 72 cores

outDir <- paste(tempdir(),"pred_output",sep=getFileSep()) # use absolute path
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
numSplits <- 2L

Finally we call the function that builds the predictor. We provide:

patient data (dataList)
grouping rules (groupList)
function to create PSN from data, includes choice of similarity metric (makeNetFunc)
number of train/test splits over which to collect feature scores and average performance (numSplits),
maximum score for features in one round of feature selection (featScoreMax, set to 10)
threshold to call feature-selected networks for each train/test split (featSelCutoff); only features scoring this value or higher will be used to classify test patients,
number of cores to use for parallel processing (numCores).

The call below runs two train/test splits. Within each split, it:

splits data into train/test using the default split of 80:20 (trainProp=0.8)
score networks between 0 to 2 (i.e. featScoreMax=2L)
uses networks that score >=1 out of 2 (featSelCutoff=1L) to classify test samples for that split.

In practice a good starting point is featScoreMax=10, featSelCutoff=9 and numSplits=10L, but these parameters depend on the sample sizes in the dataset and heterogeneity of the samples.

t0 <- Sys.time()
model <- suppressMessages(
buildPredictor(
    dataList=brca,          ## your data
    groupList=groupList,    ## grouping strategy
    sims=sims,
    outDir=outDir,          ## output directory
    trainProp=0.8,          ## pct of samples to use to train model in
                                ## each split
    numSplits=2L,            ## number of train/test splits
    featSelCutoff=1L,       ## threshold for calling something
                                ## feature-selected
    featScoreMax=2L,    ## max score for feature selection
 numCores=nco,            ## set higher for parallelizing
 debugMode=FALSE,
 keepAllData=FALSE,     ## set to TRUE for debugging
 logging="none"     ## set to "default" for messages
  ))
t1 <- Sys.time()
print(t1-t0)

## Time difference of 15.3825 mins

6 Examine results

We now use getResults() to fetch the model performance for the various train/test splits as well as feature scores:

results <- getResults(
    model,
    unique(colData(brca)$STATUS),
    featureSelCutoff=2L,
    featureSelPct=0.50
  )

## Detected 2 splits and 2 classes

## * Plotting performance

## * Compiling feature scores and calling selected features

results contains performance, selectedFeatures for each patient label, and the table of feature scores.

summary(results)

##                  Length Class  Mode
## selectedFeatures 2      -none- list
## featureScores    2      -none- list
## performance      4      -none- list

Look at the performance:

results$performance

## $meanAccuracy
## [1] 0.8402542
## 
## $splitAccuracy
## [1] 0.8500000 0.8305085
## 
## $splitAUROC
## [1] 1 1
## 
## $splitAUPR
## [1] 0.9767442 0.9761905

Look at feature scores for all labels, across all train-test splits:

results$featureScores

## $Luminal.A
##                              Feature Split1 Split2
## 1 BRCA_Methylation_methyl27-20160128      1      1
## 2            BRCA_mRNAArray-20160128      2      2
## 3         BRCA_miRNASeqGene-20160128     NA      1
## 4                                age     NA      2
## 5                              stage      1     NA
## 
## $other
##                              Feature Split1 Split2
## 1 BRCA_Methylation_methyl27-20160128      2      2
## 2            BRCA_mRNAArray-20160128      2      2
## 3         BRCA_miRNASeqGene-20160128      1      1
## 4                                age      2      2
## 5                              stage      2      2

Let’s examine our confusion matrix:

confMat <- confusionMatrix(model)

Note: Rows of this matrix don’t add up to 100% because the matrix is an average of the confusion matrices from all of the train/test splits.

And here are selected features, which are those scoring 2 out of 2 in at least half of the splits. This threshold is simply for illustration. In practice we would run at least 10 train/test splits (ideally 100+), and look for features that score 7+ out of 10 in >70% splits.

results$selectedFeatures

## $Luminal.A
## [1] "BRCA_mRNAArray-20160128" "age"                    
## 
## $other
## [1] "BRCA_Methylation_methyl27-20160128" "BRCA_mRNAArray-20160128"           
## [3] "age"                                "stage"

7 Validate on independent samples

Now we use predict() to classify samples in the independent dataset. We provide the model with feature design rules in groupList, the list of selected features to use in featSelNet, the function to convert data into patient similarity networks in makeNets, as well as the original and validated datasets in brca and holdout respectively.

The training data needs to be provided because netDx creates a single patient similarity network with both training and test data. It then uses label propagation to “diffuse” patient labels from training samples to test samples, and labels the latter based on which class they are most similar to.

outDir <- paste(tempdir(), randAlphanumString(), 
  sep = getFileSep())
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)
dir.create(outDir)

predModel <- suppressMessages(
  predict(trainMAE=brca, testMAE=holdout, 
    groupList=groupList, 
    selectedFeatures=results$selectedFeatures, 
    sims=sims,
    outDir=outDir, verbose = FALSE)
)

##            TT_STATUS
## STATUS      TEST TRAIN
##   Luminal.A   16   214
##   other       16    81
##            PRED_CLASS
## STATUS      Luminal.A other
##   Luminal.A        15     1
##   other             0    16

8 Plot results of validation

Finally we examine how well our model performed, using getPerformance().

Compute performance:

perf <- getPerformance(predModel, 
  unique(colData(brca)$STATUS))
summary(perf)

##          Length Class       Mode   
## rocCurve 1      performance S4     
## prCurve  1      performance S4     
## auroc    1      -none-      numeric
## aupr     1      -none-      numeric
## accuracy 1      -none-      numeric

We plot the AUROC and AUPR curves using plotPerf_multi(). In this example we get perfect separation of the two classes.

plotPerf_multi(list(perf$rocCurve),
  plotTitle = sprintf(
    "BRCA Validation: %i samples", 
    nrow(colData(holdout))))

plotPerf_multi(list(perf$prCurve), 
  plotType = "PR",
  plotTitle = sprintf(
    "BRCA Validation: %i samples", 
    nrow(colData(holdout))))

8.1 Integrated patient similarity network

We finally get the integrated PSN and visualize it using a tSNE plot:

## this call doesn't work in Rstudio; for now we've commented this out and saved the PSN file. 
psn <- suppressMessages(getPSN(
  brca,
  groupList,
  sims=sims,
  selectedFeatures=results$selectedFeatures
))

## Warning in dir.create(paste(netDir, "profiles", sep = fsep)):
## '/tmp/RtmpiGvVno/profiles' already exists

We can plot a lower dimensional representation of the patient similarities using a tSNE plot. This call requires that you install the Rtsne package:

library(Rtsne)
tsne <- tSNEPlotter(
    psn$patientSimNetwork_unpruned, 
    colData(brca)
    )

## * Making symmetric matrix

## * Running tSNE

## * Plotting

9 sessionInfo

sessionInfo()

## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=C                 LC_NUMERIC=C              
##  [3] LC_TIME=C                  LC_COLLATE=C              
##  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Rtsne_0.17                  HDF5Array_1.33.0           
##  [3] DelayedArray_0.31.0         SparseArray_1.5.0          
##  [5] S4Arrays_1.5.0              abind_1.4-5                
##  [7] Matrix_1.7-0                rhdf5_2.49.0               
##  [9] curatedTCGAData_1.25.4      MultiAssayExperiment_1.31.0
## [11] SummarizedExperiment_1.35.0 Biobase_2.65.0             
## [13] GenomicRanges_1.57.0        GenomeInfoDb_1.41.0        
## [15] IRanges_2.39.0              S4Vectors_0.43.0           
## [17] BiocGenerics_0.51.0         MatrixGenerics_1.17.0      
## [19] matrixStats_1.3.0           netDx_1.17.0               
## [21] BiocStyle_2.33.0           
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.2               rlang_1.1.3             magrittr_2.0.3         
##  [4] compiler_4.4.0          RSQLite_2.3.6           png_0.1-8              
##  [7] vctrs_0.6.5             reshape2_1.4.4          combinat_0.0-8         
## [10] stringr_1.5.1           pkgconfig_2.0.3         shape_1.4.6.1          
## [13] crayon_1.5.2            fastmap_1.1.1           dbplyr_2.5.0           
## [16] XVector_0.45.0          labeling_0.4.3          utf8_1.2.4             
## [19] rmarkdown_2.26          pracma_2.4.4            UCSC.utils_1.1.0       
## [22] purrr_1.0.2             bit_4.0.5               xfun_0.43              
## [25] glmnet_4.1-8            zlibbioc_1.51.0         cachem_1.0.8           
## [28] jsonlite_1.8.8          blob_1.2.4              highr_0.10             
## [31] rhdf5filters_1.17.0     Rhdf5lib_1.27.0         uuid_1.2-0             
## [34] parallel_4.4.0          R6_2.5.1                bslib_0.7.0            
## [37] stringi_1.8.3           RColorBrewer_1.1-3      jquerylib_0.1.4        
## [40] Rcpp_1.0.12             bookdown_0.39           iterators_1.0.14       
## [43] knitr_1.46              BiocBaseUtils_1.7.0     splines_4.4.0          
## [46] igraph_2.0.3            tidyselect_1.2.1        yaml_2.3.8             
## [49] doParallel_1.0.17       codetools_0.2-20        curl_5.2.1             
## [52] lattice_0.22-6          tibble_3.2.1            plyr_1.8.9             
## [55] withr_3.0.0             KEGGREST_1.45.0         ROCR_1.0-11            
## [58] evaluate_0.23           survival_3.6-4          BiocFileCache_2.13.0   
## [61] ExperimentHub_2.13.0    Biostrings_2.73.0       pillar_1.9.0           
## [64] BiocManager_1.30.22     filelock_1.0.3          foreach_1.5.2          
## [67] generics_0.1.3          BiocVersion_3.20.0      ggplot2_3.5.1          
## [70] munsell_0.5.1           scales_1.3.0            glue_1.7.0             
## [73] tools_4.4.0             AnnotationHub_3.13.0    grid_4.4.0             
## [76] plotrix_3.8-4           bigmemory_4.6.4         AnnotationDbi_1.67.0   
## [79] colorspace_2.1-0        GenomeInfoDbData_1.2.12 cli_3.6.2              
## [82] rappdirs_0.3.3          bigmemory.sri_0.1.8     fansi_1.0.6            
## [85] dplyr_1.1.4             gtable_0.3.5            sass_0.4.9             
## [88] digest_0.6.35           farver_2.1.1            memoise_2.0.1          
## [91] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
## [94] mime_0.12               bit64_4.0.5

01. Build & test classifier with clinical and multi-omic data & pathway features

2024-05-01

Package

1 Introduction

2 Setup

3 Get and Prepare Data

3.1 MultiAssayExperiment object

3.2 Prepare data

3.3 Holdout validation set

4 Rules to create features (patient similarity networks)

4.0.1 groupList: Grouping variables to define features

4.0.2 `sims`: Define patient similarity for each network

5 Build predictor

6 Examine results

7 Validate on independent samples

8 Plot results of validation

8.1 Integrated patient similarity network

9 sessionInfo

01. Build & test classifier with clinical and multi-omic data & pathway features

2024-05-01

Package

1 Introduction

2 Setup

3 Get and Prepare Data

3.1 MultiAssayExperiment object

3.2 Prepare data

3.3 Holdout validation set

4 Rules to create features (patient similarity networks)

4.0.1 groupList: Grouping variables to define features

4.0.2 sims: Define patient similarity for each network

5 Build predictor

6 Examine results

7 Validate on independent samples

8 Plot results of validation

8.1 Integrated patient similarity network

9 sessionInfo

4.0.2 `sims`: Define patient similarity for each network