1 Package overview

This overview provides insight into the available datasets (R package version 1.5.0) provided via ExperimentHub cloud services. The main data class is a MultiAssayExperiment (MAE) object compatible with numerous Bioconductor packages.

3 different omics base data types and accompanying clinical/phenotype data are currently available:

gex.* assays contain gene expression values, with the suffix wildcard indicating unit or method for gene expression
cna.* assays contain copy number values, with the suffix wildcard indicating method for copy number alterations
mut assays contain somatic mutation calls
MultiAssayExperiment::colData(maeobj) contains the clinical metadata curated based on a pre-defined template

Their availability is subject to the study in question, and you will find coverage of the omics here-in. Furthermore, derived variables based on these base data types are provided in the constructed MultiAssayExperiment (MAE) class objects.

For a comprehensive guide on how to neatly handle such MAE objects, refer to the MultiAssayExperiment user guide (or cheat-sheets): [MAE User Guide] (https://www.bioconductor.org/packages/devel/bioc/vignettes/MultiAssayExperiment/inst/doc/MultiAssayExperiment.html) .

2 R-package

The curatedPCaData package contains a collection of manually curated datasets concerning patients diagnosed with prostate cancer. The datasets within this package have followed uniform processing and naming conventions to allow users to more easily reproduce similar analyses between datasets and spend less time concerned with harmonzing data from different sources.

3 Downloading data from ExperimentHub or loading them from local cache

To get a full list of available datasets see the documentation for getPCa function, or via querying ExperimentHubdirectly for the components used to construct MultiAssayExperiments for the studies. However, getPCa is aimed to comprehensively provide readily usable multi-omics compatible MAE-objects.

3.1 All available datasets

Fetching all datasets available in curatedPCaData:

library(curatedPCaData)

# Use a function to extract all known study short identifiers
studies <- curatedPCaData::getPCaStudies()
studies
#>  [1] "abida"        "baca"         "barbieri"     "barwick"      "chandran"    
#>  [6] "friedrich"    "hieronymus"   "icgcca"       "igc"          "kim"         
#> [11] "kunderfranco" "ren"          "sun"          "taylor"       "tcga"        
#> [16] "true"         "wallace"      "wang"         "weiner"

# List apply across studies to extract all MAE objects corresponding to the
# short identifiers
maes <- lapply(studies, FUN = \(id) {
    curatedPCaData::getPCa(id)
})
names(maes) <- studies

3.1.1 Dataset criteria

The datasets were manually selected based on various criteria, such as:

Primary data availability (preferably raw data available)
Data platform types and their overlap (gene expression, copy number alteration, mutation data, …)
End points (e.g. recurrence, Gleason, …)
Clinical metadata availability and reliability
Design of the study

3.1.2 Studies

The function getPCa utilizes the studies’ short name for identifying which data to extract. An overview into the main datasets is as follows:

# Create a summary table depicting key features in available studies
studytable <- curatedPCaData::getPCaSummaryStudies(maes)

Table 1: Key study characteristics
Study short id	Sample types	GEX/CNA/MUT platform(s)	Notes	Data source	Reference(s)
abida	metastatic: 444			cBioPortal	Abida et al.
baca	metastatic: 2, primary: 55			cBioPortal	Baca et al.
barbieri	primary: 123			cBioPortal	Barbieri et al.
barwick	primary: 146	Custom DASL		GEO	Barwick et al.
chandran	metastatic: 25, normal: 81, primary: 65	GPL8300 [HG_U95Av2]		GEO	Chandran et al., Yu et al.
friedrich	BPH: 39, normal: 52, primary: 164	Custom Agilent array		GEO	Friedrich et al.
hieronymus	primary: 104	GPL8737 Agilent-021529 Human CGH	CNA only	GEO	Hieronymus et al.
icgcca	primary: 213		Canadian data from International Cancer Genome Collaboratory	ICGC Data Portal (PRAD-CA)	PRAD-CA in Zhang et al.
igc	primary: 83	GPL570 [HG-U133_Plus_2]		GEO	GEO accession code GSE2109
kim	primary: 266	GPL5188 [HuEx-1_0-st]		GEO	Kim et al.
kunderfranco	normal: 14, primary: 53	GPL887 Agilent-012097 Human 1A Microarray (V2)		GEO	Kunderfranco et al., Peraldo-Neia et al., Longoni et al.
ren	primary: 65			cBioPortal	Ren et al.
sun	primary: 79	GPL96 [HG-U133A]		GEO	Sun et al.
taylor	metastatic: 37, normal: 29, primary: 181	GEX: GPL5188 [HuEx-1_0-st], CNA: GPL4091 Agilent CGH	Also known as MSKCC	GEO	Taylor et al.
tcga	metastatic: 1, normal: 52, primary: 498			Xenabrowser	Cancer Genome Atlas Research Network, Goldman et al.
true	primary: 32	GPL3834 FHCRC Human Prostate PEDB cDNA v3 / v4		GEO	True et al.
wallace	normal: 20, primary: 69	GPL571 [HG-U133A_2]		GEO	Wallace et al.
wang	BPH: 55, atrophic: 21, primary: 60	GPL96 [HG-U133A]		GEO	Wang et al., Jia et al.
weiner	primary: 838	GPL5175 [HuEx-1_0-st]		GEO	Weiner et al.

Please note that the TCGA PCa dataset is a subset of the TCGA pan-cancer initiative. For a package focused on TCGA exclusively beyond the PRAD subset, see the Bioconductor package [curatedTCGAData] (https://bioconductor.org/packages/release/data/experiment/html/curatedTCGAData.html).

3.1.3 Citations

The use of curatedPCaData ought to be cited with (Laajala et al. 2023).

For individual datasets there-in, the following citations are suggested:

Abida et al. : (Abida et al. 2019)
Baca et al. : (Baca et al. 2013)
Barbieri et al. : (Barbieri et al. 2012)
Barwick et al. : (Barwick et al. 2010)
Chandran et al. : (Chandran et al. 2007)
Friedrich et al. : (Friedrich et al. 2020)
Hieronymus et al. : (Hieronymus et al. 2014)
ICGC (Canadian) : PRAD-CA in (Zhang et al. 2019)
IGC : GEO accession GSE2109
Kim et al. : (Kim et al. 2018)
Kunderfranco et al. : (Kunderfranco et al. 2010), (Peraldo-Neia et al. 2011), (Longoni et al. 2012)
Ren et al. : (Ren et al. 2018)
Sun et al. : (Sun and Goodison 2009)
Taylor et al. : (Taylor et al. 2010)
TCGA : (Abeshouse et al. 2015), (Goldman et al. 2020)
True et al. : (True et al. 2006)
Wallace et al. : (Wallace et al. 2008)
Wang et al. : (Wang et al. 2010), (Jia et al. 2011)
Weiner et al. : (Weiner et al. 2021)

3.1.4 Curated clinical variables

The curatedPCaData-package has been curated with an emphasis on the following primary clinical metadata, which were extracted and cleaned up always when available:

Table 2: Template for prostate adenocarcinoma clinical
metadata
col.name	var.class	uniqueness	requiredness	allowedvalues	description
study_name	character	non-unique	required	*	The study name that will link this information to the study meta-data.
patient_id	character	non-unique	required	*	A unique identifier for the patient. A single patient may have more than one sample taken.
sample_name	character	unique	required	*	Primary sample identifier
alt_sample_name	character	unique	optional	*	If an alternative identifier is available, for example in supplemental tables or any of the repositories, it is reported here.
overall_survival_status	integer	non-unique	optional	1 \| 0	Binarized status of the patient, where 1 represents death and 0 represents no reported death.
days_to_overall_survival	numeric	non-unique	optional	[1-10000]	Time to death or last follow-up in days.
age_at_initial_diagnosis	integer	non-unique	optional	[1-9][0-9]	Age at diagnosis in years.
year_diagnosis	integer	non-unique	optional	[1900-2010]	The year at which the patient was diagnoses with PCa.
gleason_grade	integer	non-unique	optional	2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \| 9 \| 10	Gleason grade: sum of Gleason major plus minus.
gleason_major	integer	non-unique	optional	2 \| 3 \| 4 \| 5	Gleason grade for the major site score.
gleason_minor	integer	non-unique	optional	2 \| 3 \| 4 \| 5	Gleason grade for the minor site score.
source_of_gleason	character	non-unique	optional	biopsy \| prostatectomy \| tissue_block	Source of where the pathogist performed Gleason grading.
grade_group	character	non-unique	optional	<=6 \| 3+4 \| 4+3 \| 7 \| >=8	Separation of the Gleason grades into groups that show different prognosis; if major+minor is separation not available, allow showing just sum (ie. 7).
T_pathological	integer	non-unique	optional	1 \| 2 \| 3 \| 4	Pathological T stage (assessment made after surgery), based on tumor only.
T_substage_pathological	character	non-unique	optional	a \| b \| c	Pathological T substage (assessment made after surgery), based on tumor only.
T_clinical	integer	non-unique	optional	1 \| 2 \| 3 \| 4	Clinical T stage (at time of diagnosis) based on tumor only.
T_substage_clinical	character	non-unique	optional	a \| b \| c	Clinical T substage (at time of diagnosis) based on tumor only.
ERG_fusion_CNA	integer	non-unique	optional	1 \| 0	Presence of TMPRSS2:ERG gene fusion in prostate tumor as determined by copy number alteration analysis.
ERG_fusion_IHC	integer	non-unique	optional	1 \| 0	Presence of TMPRSS2:ERG gene fusion in prostate tumor as determined by immunohistochemistry.
ERG_fusion_GEX	integer	non-unique	optional	1 \| 0	Presence of TMPRSS2:ERG gene fusion in prostate tumor as determined by gene expression.
disease_specific_recurrence_status	integer	non-unique	optional	1 \| 0	Binarized status of the patient, where 1 represents a recurrence and 0 represents no reported recurrence.
days_to_disease_specific_recurrence	numeric	non-unique	optional	[1-10000]	Time to recurrence or last follow-up in days.
metastasis_occurrence_status	integer	non-unique	optional	1 \| 0	Binarized status of the patient, where 1 represents a metastatic occurrence and 0 represents no reported metastatic occurrence.
days_to_metastatic_occurrence	numeric	non-unique	optional	[1-10000]	Time to metastatic occurrence or last follow-up in days.
psa	numeric	non-unique	optional	[0-10000]	Prostate specific antigen level (numeric) at diagnosis.
race	character	non-unique	optional	caucasian \| african_american \| asian \| other	Ethnicity of the patient.
smoking_status	integer	non-unique	optional	1 \| 0	Binarized indicator for smoker (past or current) or never-smoker.
extraprostatic_extension	integer	non-unique	optional	1 \| 0	Spread of prostate cancer out of the prostate gland. Denotes a later stage of prostate cancer (NOTE: We do not currently distinguish between focal, established and multifocal, which are all currently translated into this template as y).
perineural_invasion	integer	non-unique	optional	1 \| 0	Cancer spreading to the space surrounding a nerve.
seminal_vesicle_invasion	integer	non-unique	optional	1 \| 0	Cancer has spread to the seminal vesicles.
angiolymphatic_invasion	integer	non-unique	optional	1 \| 0	Cancer has spread to blood vessels and lymph vessels.
androgen_ablation	integer	non-unique	optional	1 \| 0	Medical treatment to suppress or block the production of male sex hormones.
capsule	character	non-unique	optional	extensive \| focal \| intact	Status of the prostate capsule.
M_stage	character	non-unique	optional	X \| 0 \| 1	Metastasis status at the time of surgery. X: cannot evaluate distant metastasis, 0: there is no distant metastasis, 1: there is distant metastasis.
M_substage	character	non-unique	optional	[abc]	Letter substage in M1[abc]; M1a: the cancer has spread to lymph nodes beyond the regional ones, M1b: the cancer has spread to_bone, M1c: the cancer has spread to other sites (regardless of bone involvement).
other_patient	character	non-unique	optional	*	A character string that captures any additional patient information, features separated by bar (e.g. feature_1=value \| feature_2=value \| feature_3=value).
sample_type	character	non-unique	optional	primary \| metastatic \| normal \| BPH \| atrophic \| cell.line \| xenograft	Type of tissue isolated from the patient and used for further -omic profiling. Normal samples include both healthy individuals and healthy tissue from PCa patients, and metastatic indicates non-primary tumors.
sample_paired	integer	non-unique	optional	1 \| 0	Whether the platform produces paired sample (e.g. 2-channel relative intensities); the paired sample can be a patient specific paired sample or a generic panel of normals.
genomic_alterations	character	non-unique	optional	*	Character string with the list of reported alterations in the sample, in the format, gene:event, separated by a bar (eg. TP53:mutation \| ETV1:fusion \| PTEN:deletion).
tumor_margins_positive	integer	non-unique	optional	1 \| 0	Histologically altered cells in any surgical margins.
tissue_source	character	non-unique	optional	biopsy \| TURP \| prostatectomy \| prostatectomy_and_TURP \| autopsy \| cystoprostatectomy \| other	The source of the sample.
metastatic_site	character	non-unique	optional	prostate \| liver \| lung \| bone \| brain \| lymph_node \| soft_tissue \| adrenal_gland \| other	Site where the metastatic sample was taken from.
microdissected	integer	non-unique	optional	1 \| 0	Microdissected or not.
frozen_ffpe	character	non-unique	optional	frozen \| FFPE	Frozen or FFPE
other_feature	character	non-unique	optional	CRPC \| cribriform \| neuroendocrine	Other descriptions of the sample.
batch	character	non-unique	optional	*	A way that describes a batch and provides an effect that can be modeled (can be numeric or categorical).
other_sample	character	non-unique	optional	*	A character string that captures any additional sample information, features separated by bar (e.g. feature_1=value \| feature_2=value \| feature_3=value).
tumor_purity_pathology	integer	non-unique	optional	[0-100]	Estimate of the tumor purity according to pathological assessment.
tumor_purity_demixt	integer	non-unique	optional	[0-100]	Estimate of the tumor purity in the sample using the DeMixT method.
tumor_purity_absolute	integer	non-unique	optional	[0-100]	Estimate of the tumor purity in the sample using the Absolute method.
tumor_purity_ascat	numeric	non-unique	optional	[0-100]	Estimate of the tumor purity in the sample using ASCAT (Allele-Specific Copy number Analysis of Tumours).
zone_of_origin	character	non-unique	optional	transitional \| peripheral \| mixed \| central	Zone of origin assessed through tissue pathology.
zone_of_origin_estimated	character	non-unique	optional	transitional \| peripheral	Estimate of zone of origin using the method from Sinnott et al.
mutational_signatures	character	non-unique	optional	[]	Estimate of mutational signatures using the deconstructSigs method.
neoantigen_load	character	non-unique	optional	[]	Estimate of mutational load using the NetMHCPan method.
AR_activity	integer	non-unique	optional	[0-100]	AR 20-gene signature as provided by original authors.
prolaris	numeric	non-unique	optional	*	Prolaris risk score as provided by original authors.
decipher	numeric	non-unique	optional	*	Decipher risk score as provided by original authors.
oncotypedx	numeric	non-unique	optional	*	Oncotype DX risk score as provided by original authors.
N_stage	character	non-unique	optional	X \| 0 \| 1 \| 2 \| 3	Regional lymph node status at the time of surgery. X: cannot be measured, 0: no cancer in nearby lymph nodes, 1,2,3: the number and location of lymph nodes that contain cancer.
N_substage	character	non-unique	optional	[abc]	1a: the cancer has spread to lymph nodes beyond the regional ones, 1b: the cancer has spread to_bone, M1c: the cancer has spread to other sites (regardless of bone involvement).
therapy_radiation_initial	integer	non-unique	optional	1 \| 0	If radiation given as a primary therapy.
therapy_radiation_salvage	integer	non-unique	optional	1 \| 0	If radiation given after relapse from surgery.
therapy_surgery_initial	integer	non-unique	optional	1 \| 0	If surgery was performed as a primary therapy.
therapy_hormonal_initial	integer	non-unique	optional	1 \| 0	If hormonal therapy given as a primary therapy.
other_treatment	character	non-unique	optional	fish_oil \| no_neoadjuvant \| prednisone \| selenium \| vitaminE \| taxane \| other	Any other treatments
psa_category	character	non-unique	optional	Normal \| Elevated	If PSA was normal or elevated in the patient at baseline.
genome_altered	numeric	non-unique	optional	[0-x]	Numeric value depicting fraction of genome altered or absolute number mutations called preferring prior.

3.1.5 Clinical end-points

Three primary clinical end-points were utilized and are offered in the clinical metadata in colData for the MAE-objects, if available:

Gleason grade/Grade group(s)
Biochemical Recurrence (BCR)
Overall Survival (OS)

Below are summaries for each of these endpoints for each study. Of note, OS had very few events, thus survival modelling for this end-point may be considered unreliable.

Gleason grades:

# Create a summary table of Gleason grades
gleasons <- curatedPCaData::getPCaSummaryTable(maes, var.name = "gleason_grade",
    vals = 5:10)

Table 3: Gleason grades across datasets in
curatedPCaData
	5	6	7	8	9	10	Other	N/A
abida	-	29 (7%)	107 (24%)	69 (16%)	128 (29%)	24 (5%)	NA	NA
baca	-	8 (14%)	35 (61%)	8 (14%)	4 (7%)	-	NA	NA
barbieri	-	13 (11%)	87 (71%)	8 (7%)	4 (3%)	-	NA	NA
barwick	2 (1%)	39 (27%)	93 (64%)	5 (3%)	7 (5%)	-	NA	NA
chandran	2 (1%)	16 (9%)	27 (16%)	7 (4%)	12 (7%)	-	NA	NA
friedrich	2 (1%)	47 (18%)	54 (21%)	68 (27%)	43 (17%)	2 (1%)	NA	NA
hieronymus	-	16 (15%)	78 (75%)	4 (4%)	6 (6%)	-	NA	NA
icgcca	-	12 (6%)	58 (27%)	5 (2%)	-	-	NA	NA
igc	-	-	-	-	-	-	NA	NA
kim	1 (0%)	198 (74%)	65 (24%)	-	-	-	NA	NA
kunderfranco	1 (1%)	9 (13%)	32 (48%)	6 (9%)	5 (7%)	-	NA	NA
ren	-	-	-	-	-	-	NA	NA
sun	-	-	-	-	-	-	NA	NA
taylor	-	53 (21%)	105 (43%)	18 (7%)	19 (8%)	-	NA	NA
tcga	-	50 (9%)	288 (52%)	67 (12%)	142 (26%)	4 (1%)	NA	NA
true	-	4 (12%)	22 (69%)	1 (3%)	5 (16%)	-	NA	NA
wallace	2 (2%)	22 (25%)	59 (66%)	1 (1%)	2 (2%)	-	NA	NA
wang	-	-	-	-	-	-	NA	NA
weiner	-	-	-	-	-	-	0 (0%)	838 (100%)

Grade groups:

# Create a summary table of grade groups
gradegroups <- curatedPCaData::getPCaSummaryTable(maes, var.name = "grade_group",
    vals = c("<=6", "3+4", "4+3", "7", ">=8"))

Table 4: Grade groups across datasets in
curatedPCaData
	<=6	3+4	4+3	7	>=8	Other	N/A
abida	-	-	-	-	-	NA	NA
baca	8 (14%)	23 (40%)	12 (21%)	-	12 (21%)	NA	NA
barbieri	13 (11%)	58 (47%)	29 (24%)	-	12 (10%)	NA	NA
barwick	-	-	-	-	-	NA	NA
chandran	19 (11%)	-	-	27 (16%)	19 (11%)	NA	NA
friedrich	49 (19%)	-	-	54 (21%)	113 (44%)	NA	NA
hieronymus	16 (15%)	56 (54%)	22 (21%)	-	10 (10%)	NA	NA
icgcca	12 (6%)	37 (17%)	21 (10%)	-	5 (2%)	NA	NA
igc	27 (33%)	-	-	40 (48%)	14 (17%)	NA	NA
kim	199 (75%)	65 (24%)	-	-	-	NA	NA
kunderfranco	10 (15%)	29 (43%)	3 (4%)	-	11 (16%)	NA	NA
ren	5 (8%)	23 (35%)	13 (20%)	-	14 (22%)	NA	NA
sun	-	-	-	-	-	NA	NA
taylor	53 (21%)	72 (29%)	33 (13%)	-	37 (15%)	NA	NA
tcga	50 (9%)	172 (31%)	116 (21%)	-	213 (39%)	NA	NA
true	4 (12%)	12 (38%)	10 (31%)	-	6 (19%)	NA	NA
wallace	24 (27%)	-	-	59 (66%)	3 (3%)	NA	NA
wang	-	-	-	-	-	NA	NA
weiner	65 (8%)	419 (50%)	183 (22%)	-	171 (20%)	0 (0%)	0 (0%)

Biochemical recurrences:

# Create a summary table of biochemical recurrences
recurrences <- curatedPCaData::getPCaSummarySurv(maes, event.name = "disease_specific_recurrence_status",
    time.name = "days_to_disease_specific_recurrence")

Table 5: Disease recurrence end point across
datasets in curatedPCaData
	0 (no event)	1 (event)	N/A (event)	Time (days, quantiles)	N/A (time)
abida	-	-	444 (100%)	-	444 (100%)
baca	-	-	57 (100%)	-	57 (100%)
barbieri	-	-	123 (100%)	-	123 (100%)
barwick	113 (77%)	33 (23%)	0 (0%)	[92,276,702,1700,2928]	0 (0%)
chandran	-	-	171 (100%)	-	171 (100%)
friedrich	-	-	255 (100%)	-	255 (100%)
hieronymus	-	-	104 (100%)	-	104 (100%)
icgcca	-	-	213 (100%)	-	213 (100%)
igc	-	-	83 (100%)	-	83 (100%)
kim	-	-	266 (100%)	-	266 (100%)
kunderfranco	-	-	67 (100%)	-	67 (100%)
ren	-	-	65 (100%)	-	65 (100%)
sun	40 (51%)	39 (49%)	0 (0%)	-	79 (100%)
taylor	136 (55%)	60 (24%)	51 (21%)	[3,710,1360,1955,4909]	51 (21%)
tcga	448 (81%)	103 (19%)	0 (0%)	[0,398,782,1363,5024]	4 (1%)
true	-	-	32 (100%)	-	32 (100%)
wallace	-	-	89 (100%)	-	89 (100%)
wang	-	-	148 (100%)	-	148 (100%)
weiner	-	-	838 (100%)	-	838 (100%)

Overall survival:

# Create a summary table of overall survival
survivals <- curatedPCaData::getPCaSummarySurv(maes, event.name = "overall_survival_status",
    time.name = "days_to_overall_survival")

Table 6: Overall survival end point across datasets
in curatedPCaData
	0 (no event)	1 (event)	N/A (event)	Time (days, quantiles)	N/A (time)
abida	52 (12%)	84 (19%)	308 (69%)	[51,326,605,898,2104]	308 (69%)
baca	-	-	57 (100%)	-	57 (100%)
barbieri	-	-	123 (100%)	-	123 (100%)
barwick	-	-	146 (100%)	-	146 (100%)
chandran	-	-	171 (100%)	-	171 (100%)
friedrich	230 (90%)	25 (10%)	0 (0%)	[641,3005,3614,4301,6771]	91 (36%)
hieronymus	96 (92%)	8 (8%)	0 (0%)	[295,1576,2139,2895,3758]	0 (0%)
icgcca	198 (93%)	8 (4%)	7 (3%)	[1460,2190,2920,3650,4745]	1 (0%)
igc	-	-	83 (100%)	-	83 (100%)
kim	-	-	266 (100%)	-	266 (100%)
kunderfranco	-	-	67 (100%)	-	67 (100%)
ren	-	-	65 (100%)	-	65 (100%)
sun	-	-	79 (100%)	-	79 (100%)
taylor	-	-	247 (100%)	-	247 (100%)
tcga	541 (98%)	10 (2%)	0 (0%)	[23,543,930,1446,5024]	0 (0%)
true	-	-	32 (100%)	-	32 (100%)
wallace	-	-	89 (100%)	-	89 (100%)
wang	-	-	148 (100%)	-	148 (100%)
weiner	-	-	838 (100%)	-	838 (100%)

3.2 Querying datasets

The function getPCa functions as the primary interface with building MAE-objects from either live download from ExperimentHub or by loading them from local cache, if the datasets have been downloaded previously.

The syntax for the function getPCa(dataset, assays, timestamp, verbose, ...) consists of the following parameters:

dataset: Primary indicator for which study to query from ExperimentHub; notice that this may only be one of the allowed values.
assays: This indicates which MAE-assays are fetched from the candidate ExperimentList. Two names are always required (and are filled if missing): colData which contains information on the clinical metadata, and sampleMap which maps the rownames of the metadata to columns in the fetched assay data.
timestamp: When data is deposited in the ExperimentHub resources, they are time stamped to avoid ambiguity. The timestamps provided in this parameter are resolved from left to right, and the first deposit stamp is "20230215".
verbose: Logical indicator whether additional information should be printed by getPCa.
...: Further custom parameters passed on to getPCa.

As an example, let us consider querying the TCGA dataset, but suppose only wish to extract the gene expression data, and the immune deconvolution results derived by the method xCell. Further, we’ll request risk and AR scores slot. This subset could be retrieved with:

tcga_subset <- getPCa(dataset = "tcga", assays = c("gex.rsem.log", "xcell", "scores"),
    timestamp = "20230215")

tcga_subset
#> A MultiAssayExperiment object of 3 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 3:
#>  [1] gex.rsem.log: matrix with 19658 rows and 461 columns
#>  [2] xcell: matrix with 39 rows and 461 columns
#>  [3] scores: matrix with 4 rows and 461 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

The standard way of extracting the latest MAE-object with all available assays is done via querying with just the dataset name:

mae_tcga <- getPCa("tcga")
mae_taylor <- getPCa("taylor")

3.2.1 Accessing primary data

The primary assay names in the MAE objects for gene expression and copy number alteration will consist of two parts. Mutation data is provided as a RaggedExperiment object.

Prefix indicating data type, either “gex.” or “cna.”.
Suffix indicating unit and processing for the data; for example, a gene expression dataset (gex) may have a suffix of “rma” for RMA-processed data, “fpkm” for processed RNA-seq data, “relz” for relative z-score normalized expression values for tumor-normal gene expression pairs, or “logq” for logarithmic quantile-normalized data. The main suffix for copy number alteration is the discretized GISTIC alteration calls with values {-2,-1,0,1,2}, although earlier version also provided log-ratios (“logr”)
Mutation data is provided as RaggedExperiment objects as “mut”.

The standard way for accessing a data slot in MAE could be done for example via:

mae_taylor[["gex.rma"]][1:5, 1:5]
#>          PCA0001   PCA0002   PCA0003   PCA0005   PCA0007
#> A1BG    7.957155  7.966502  8.237386  7.863266  7.799571
#> A1CF    4.479802  4.565785  4.719560  4.453506  4.521248
#> A2M    11.603692 11.145376 10.560886 10.587447 10.069504
#> A2ML1   4.670252  5.001697  4.970436  4.737025  4.756805
#> A4GALT  7.945086  8.228138  8.377694  8.163593  8.071018

The corresponding clinical variables have an accessor function colData provided by the MultiAssayExperiment-package:

MultiAssayExperiment::colData(mae_tcga)[1:2, ]
#> DataFrame with 2 rows and 68 columns
#>                  study_name   patient_id     sample_name        alt_sample_name
#>                 <character>  <character>     <character>            <character>
#> TCGA.2A.A8VL.01        TCGA TCGA.2A.A8VL TCGA.2A.A8VL.01 F9F392D3-E3C0-4CF2-A..
#> TCGA.2A.A8VO.01        TCGA TCGA.2A.A8VO TCGA.2A.A8VO.01 0BD35529-3416-42DD-A..
#>                 overall_survival_status days_to_overall_survival
#>                               <integer>                <numeric>
#> TCGA.2A.A8VL.01                       0                      621
#> TCGA.2A.A8VO.01                       0                     1701
#>                 age_at_initial_diagnosis year_diagnosis gleason_grade
#>                                <integer>      <integer>     <integer>
#> TCGA.2A.A8VL.01                       51           2010             6
#> TCGA.2A.A8VO.01                       57           2010             6
#>                 gleason_major gleason_minor source_of_gleason grade_group
#>                     <integer>     <integer>       <character> <character>
#> TCGA.2A.A8VL.01             3             3                NA         <=6
#> TCGA.2A.A8VO.01             3             3                NA         <=6
#>                 T_pathological T_substage_pathological T_clinical
#>                      <integer>             <character>  <integer>
#> TCGA.2A.A8VL.01              2                       b         NA
#> TCGA.2A.A8VO.01              3                       a          1
#>                 T_substage_clinical ERG_fusion_CNA ERG_fusion_IHC
#>                         <character>      <integer>      <integer>
#> TCGA.2A.A8VL.01                  NA             NA             NA
#> TCGA.2A.A8VO.01                   c             NA             NA
#>                 ERG_fusion_GEX disease_specific_recurrence_status
#>                      <numeric>                          <integer>
#> TCGA.2A.A8VL.01              1                                  0
#> TCGA.2A.A8VO.01              0                                  0
#>                 days_to_disease_specific_recurrence
#>                                           <integer>
#> TCGA.2A.A8VL.01                                 621
#> TCGA.2A.A8VO.01                                1701
#>                 metastasis_occurrence_status days_to_metastatic_occurrence
#>                                    <integer>                     <numeric>
#> TCGA.2A.A8VL.01                           NA                            NA
#> TCGA.2A.A8VO.01                           NA                            NA
#>                       psa        race smoking_status extraprostatic_extension
#>                 <numeric> <character>      <integer>                <integer>
#> TCGA.2A.A8VL.01      0.05   caucasian             NA                       NA
#> TCGA.2A.A8VO.01      0.05   caucasian             NA                       NA
#>                 perineural_invasion seminal_vesicle_invasion
#>                           <integer>                <integer>
#> TCGA.2A.A8VL.01                  NA                        0
#> TCGA.2A.A8VO.01                  NA                        1
#>                 angiolymphatic_invasion androgen_ablation     capsule   M_stage
#>                               <integer>         <integer> <character> <numeric>
#> TCGA.2A.A8VL.01                       0                NA          NA         0
#> TCGA.2A.A8VO.01                       0                NA          NA         0
#>                  M_substage other_patient sample_type sample_paired
#>                 <character>     <logical> <character>     <integer>
#> TCGA.2A.A8VL.01                        NA     primary            NA
#> TCGA.2A.A8VO.01                        NA     primary            NA
#>                 genomic_alterations tumor_margins_positive tissue_source
#>                           <logical>              <integer>   <character>
#> TCGA.2A.A8VL.01                  NA                     NA        biopsy
#> TCGA.2A.A8VO.01                  NA                     NA        biopsy
#>                 metastatic_site microdissected frozen_ffpe other_feature
#>                     <character>      <integer> <character>   <character>
#> TCGA.2A.A8VL.01              NA             NA          NA            NA
#> TCGA.2A.A8VO.01              NA             NA          NA            NA
#>                       batch other_sample tumor_purity_pathology
#>                 <character>    <logical>            <character>
#> TCGA.2A.A8VL.01          NA           NA                 31-40%
#> TCGA.2A.A8VO.01          NA           NA                 66-75%
#>                 tumor_purity_demixt tumor_purity_absolute tumor_purity_ascat
#>                           <numeric>             <numeric>          <numeric>
#> TCGA.2A.A8VL.01                0.49                  0.50                 NA
#> TCGA.2A.A8VO.01                0.55                  0.55                 NA
#>                 zone_of_origin zone_of_origin_estimated mutational_signatures
#>                    <character>              <character>             <logical>
#> TCGA.2A.A8VL.01     peripheral                       NA                    NA
#> TCGA.2A.A8VO.01          mixed                       NA                    NA
#>                 neoantigen_load AR_activity  prolaris  decipher oncotypedx
#>                       <logical>   <numeric> <numeric> <numeric>  <numeric>
#> TCGA.2A.A8VL.01              NA        9.17        NA        NA         NA
#> TCGA.2A.A8VO.01              NA        0.69        NA        NA         NA
#>                   N_stage  N_substage therapy_radiation_initial
#>                 <numeric> <character>                 <numeric>
#> TCGA.2A.A8VL.01         0          NA                         0
#> TCGA.2A.A8VO.01        NA          NA                         0
#>                 therapy_radiation_salvage therapy_surgery_initial
#>                                 <integer>               <integer>
#> TCGA.2A.A8VL.01                        NA                      NA
#> TCGA.2A.A8VO.01                        NA                      NA
#>                 therapy_hormonal_initial other_treatment psa_category
#>                                <integer>     <character>    <logical>
#> TCGA.2A.A8VL.01                       NA                           NA
#> TCGA.2A.A8VO.01                       NA                           NA
#>                 genome_altered
#>                      <numeric>
#> TCGA.2A.A8VL.01         0.0300
#> TCGA.2A.A8VO.01         0.0211

While it is ideal to make sure user is using the correct namespaces, the pckgName:: can be omitted as curatedPCaData imports necessary packages such as MultiAssayExperiment and their functions should be available in the workspace.

3.2.2 ExperimentHub data listing

In order to access the latest listing of curatedPCaData related resources available in ExperimentHub, consult the metadata.csv file delivered with the package:

metadata <- read.csv(system.file("extdata", "metadata.csv", package = "curatedPCaData"))
head(metadata)
#>                       Title
#> 1 abida_cna.gistic_20230215
#> 2   abida_gex.relz_20230215
#> 3        abida_mut_20230215
#> 4  abida_cibersort_20230215
#> 5      abida_xcell_20230215
#> 6       abida_epic_20230215
#>                                                                                                 Description
#> 1    abida_cna.gistic_20230215 Copy number alteration GISTIC data of abida cohort in curatedPCaData package
#> 2 abida_gex.relz_20230215 Gene expression (Relative z-score) data of abida cohort in curatedPCaData package
#> 3                                abida_mut_20230215 Mutation data of abida cohort in curatedPCaData package
#> 4    abida_cibersort_20230215 Deconvolution using CIBERSORTx data of abida cohort in curatedPCaData package
#> 5             abida_xcell_20230215 Deconvolution using xCell data of abida cohort in curatedPCaData package
#> 6               abida_epic_20230215 Deconvolution using EPIC data of abida cohort in curatedPCaData package
#>   BiocVersion Genome SourceType
#> 1        3.17     NA        TXT
#> 2        3.17     NA        TXT
#> 3        3.17     NA        TXT
#> 4        3.17     NA        TXT
#> 5        3.17     NA        TXT
#> 6        3.17     NA        TXT
#>                                                    SourceUrl SourceVersion
#> 1 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#> 2 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#> 3 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#> 4 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#> 5 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#> 6 https://www.cbioportal.org/study/summary?id=prad_su2c_2019          <NA>
#>        Species TaxonomyId Coordinate_1_based DataProvider
#> 1 Homo sapiens       9606                 NA        MSKCC
#> 2 Homo sapiens       9606                 NA        MSKCC
#> 3 Homo sapiens       9606                 NA        MSKCC
#> 4 Homo sapiens       9606                 NA        MSKCC
#> 5 Homo sapiens       9606                 NA        MSKCC
#> 6 Homo sapiens       9606                 NA        MSKCC
#>                             Maintainer       RDataClass DispatchClass
#> 1 Teemu Daniel Laajala <teelaa@utu.fi>           matrix           Rds
#> 2 Teemu Daniel Laajala <teelaa@utu.fi>           matrix           Rds
#> 3 Teemu Daniel Laajala <teelaa@utu.fi> RaggedExperiment           Rds
#> 4 Teemu Daniel Laajala <teelaa@utu.fi>           matrix           Rds
#> 5 Teemu Daniel Laajala <teelaa@utu.fi>           matrix           Rds
#> 6 Teemu Daniel Laajala <teelaa@utu.fi>           matrix           Rds
#>                    ResourceName                                    RDataPath
#> 1 abida_cna.gistic_20230215.Rds curatedPCaData/abida_cna.gistic_20230215.Rds
#> 2   abida_gex.relz_20230215.Rds   curatedPCaData/abida_gex.relz_20230215.Rds
#> 3        abida_mut_20230215.Rds        curatedPCaData/abida_mut_20230215.Rds
#> 4  abida_cibersort_20230215.Rds  curatedPCaData/abida_cibersort_20230215.Rds
#> 5      abida_xcell_20230215.Rds      curatedPCaData/abida_xcell_20230215.Rds
#> 6       abida_epic_20230215.Rds       curatedPCaData/abida_epic_20230215.Rds
#>         Tags
#> 1 cna.gistic
#> 2   gex.relz
#> 3        mut
#> 4  cibersort
#> 5      xcell
#> 6       epic

3.3 Omics sample count and overlap

# Retrieve samples counts across different unique assay names as well as omics
# overlap sample counts
samplecounts <- curatedPCaData::getPCaSummarySamples(maes)

The sample counts in each ’omics separately is listed below:

Table 7: Sample N counts in each omics for
every MAE object
	cibersort	cna.gistic	cna.logr	epic	estimate	gex.logq	gex.logr	gex.relz	gex.rma	gex.rsem.log	mcp	mut	quantiseq	scores	xcell
abida	266	444		266	266			266			266	444	266	266	266
baca		56										57
barbieri	31	109		31	31			31			31	112	31	31	31
barwick	146				146	146					146		146	146
chandran	171			171	171				171		171		171	171	171
friedrich	255			255	255	255					255		255	255	255
hieronymus		104
icgcca	213			213	213				213		213		213	213	213
igc	83			83	83				83		83		83	83	83
kim	266			266	266				266		266		266	266	266
kunderfranco	67			67	67		67				67		67	67	67
ren	65			65	65			65			65	65	65	65	65
sun	79			79	79				79		79		79	79	79
taylor	179	194	218	179	179				179		179	43	179	179	179
tcga	461	492		461	461					461	461	495	461	461	461
true	32			32	32		32				18		32	32
wallace	89			89	89				89		89		89	89	89
wang	148			148	148				148		148		148	148	148
weiner	838			838	838				838		838		838	838	838

However, taking intersections between different omics shows that different samples were analyzed on different platforms - therefore the effective N counts for analyzing multiple ’omics platforms simultaneously is smaller. The overlaps between gene expression (GEX), copy number alteration (CNA), and mutations (MUT) are shown below:

Table 8: Sample N counts for intersections
between different omics
	GEX	CNA	MUT	GEX & CNA	GEX & MUT	CNA & MUT	GEX & CNA & MUT
abida	266	444	444	266	266	444	266
baca	0	56	57	0	0	56	0
barbieri	31	109	112	20	20	109	20
barwick	146	0	0	0	0	0	0
chandran	171	0	0	0	0	0	0
friedrich	255	0	0	0	0	0	0
icgcca	213	0	0	0	0	0	0
igc	83	0	0	0	0	0	0
kim	266	0	0	0	0	0	0
kunderfranco	67	0	0	0	0	0	0
ren	65	0	65	0	65	0	0
sun	79	0	0	0	0	0	0
taylor	179	194	43	128	37	41	35
tcga	461	492	495	403	405	489	400
true	32	0	0	0	0	0	0
wallace	89	0	0	0	0	0	0
wang	148	0	0	0	0	0	0
weiner	838	0	0	0	0	0	0

4 Derived variables

In curatedPCaData we refer to derived variables as further downstream variables, which have been computed based on primarily data. For most cases, this was done by extracting key gene information from the gex.* assays and pre-computing informative downstream markers as described in their primary publications.

4.1 Immune deconvolution

Tumor progression depends on the immune cell composition in the tumor microenvironment. The ‘immunedeconv’ package consists of different computational methods to computationally estimate immune cell content using gene expression data. In addition, CIBERTSORTx is provided externally, as this method required registered access. For user convenience, it has been run separately and provided as a slot in the MAE objects. The other methods have been run using the immunedeconv package (Sturm et al. 2019) and code for reproducing these derived variables are provided alongside the package.

In this package, we provide estimates of immune cell content from the following deconvolution methods:

quanTIseq
xCell
EPIC
MCP counter
CIBERSORT(x)
ESTIMATE

The estimates from each of these methods are stored in the MAE object as a seperate assay as shown for example in the Taylor dataset

mae_taylor
#> A MultiAssayExperiment object of 11 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 11:
#>  [1] cna.gistic: matrix with 17832 rows and 194 columns
#>  [2] cna.logr: matrix with 18062 rows and 218 columns
#>  [3] gex.rma: matrix with 17410 rows and 179 columns
#>  [4] mut: RaggedExperiment with 90 rows and 43 columns
#>  [5] cibersort: matrix with 22 rows and 179 columns
#>  [6] xcell: matrix with 39 rows and 179 columns
#>  [7] epic: matrix with 8 rows and 179 columns
#>  [8] quantiseq: matrix with 11 rows and 179 columns
#>  [9] mcp: matrix with 11 rows and 179 columns
#>  [10] estimate: matrix with 4 rows and 179 columns
#>  [11] scores: matrix with 4 rows and 179 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

To access the quantiseq results for the Taylor et. al dataset, these pre-computed values can be obtained from the corresponding slot in the MAE-object:

head(mae_taylor[["cibersort"]])[1:5, 1:3]
#>                      PCA0001    PCA0002    PCA0003
#> B cells naive     0.08544376 0.13683292 0.03620644
#> B cells memory    0.03357079 0.00000000 0.03631384
#> Plasma cells      0.07151722 0.03991641 0.00000000
#> T cells CD8       0.20657390 0.20360110 0.23907130
#> T cells CD4 naive 0.00000000 0.00000000 0.00000000

Similarly to access results from the other immune deconvolution methods, the following assays/experiments are also available:

head(mae_taylor[["quantiseq"]])[1:5, 1:3]
#>                   PCA0001    PCA0002    PCA0003
#> B cell        0.100063011 0.08171362 0.09675483
#> Macrophage M1 0.006654406 0.01532479 0.01988157
#> Macrophage M2 0.085216324 0.08214651 0.07174060
#> Monocyte      0.000000000 0.00000000 0.00000000
#> Neutrophil    0.135286167 0.12045115 0.10142068
head(mae_taylor[["xcell"]])[1:5, 1:3]
#>                                      PCA0001    PCA0002      PCA0003
#> Myeloid dendritic cell activated 0.002601576 0.06149682 2.298156e-02
#> B cell                           0.013211541 0.01934287 0.000000e+00
#> T cell CD4+ memory               0.029688627 0.03778152 6.288573e-03
#> T cell CD4+ naive                0.002133965 0.00685892 3.883571e-18
#> T cell CD4+ (non-regulatory)     0.000000000 0.00000000 0.000000e+00
head(mae_taylor[["epic"]])[1:5, 1:3]
#>                                 PCA0001    PCA0002    PCA0003
#> B cell                       0.02249750 0.01935547 0.02441322
#> Cancer associated fibroblast 0.01962203 0.01828512 0.02019139
#> T cell CD4+                  0.18739469 0.20292446 0.18976862
#> T cell CD8+                  0.06283695 0.04923331 0.04992732
#> Endothelial cell             0.10510564 0.09707150 0.08088879
head(mae_taylor[["mcp"]])[1:5, 1:3]
#>                     PCA0001  PCA0002  PCA0003
#> T cell             6.518296 6.873941 6.771791
#> T cell CD8+        4.090739 4.168020 4.355332
#> cytotoxicity score 5.974354 6.218115 6.462306
#> NK cell            5.768176 5.857110 6.348053
#> B cell             6.655998 6.431666 6.711109

Each row of the deconvolution matrix represents the content of a certain immune cell type and the columns represent the patient sample IDs. The variables on the rows are specific for each method. Further, it should be noted that not all methods could be run on all datasets due to lack of overlap in genes of interest.

4.2 Risk scores and other metrics

The slot scores is used to provide key risk scores or other informative metrics based on the primary data. These scores can be accessed as a matrix as if they were variables on an assay with this name:

mae_tcga[["scores"]][, 1:4]
#>          TCGA.G9.6348.01 TCGA.CH.5766.01 TCGA.EJ.A65G.01 TCGA.EJ.5527.01
#> decipher     -0.08265918      0.01072473      -0.2997221       -0.086624
#> oncotype     -3.47973684     -3.54594586      -3.9224176       -3.461314
#> prolaris     -4.95408262     -5.76978328      -4.0708355       -6.194230
#> ar_score     -0.08262097      0.65489227      12.4016220        5.029849

The following PCa risk scores are offered:

Decipher (rowname: decipher) (Herlemann et al. 2019)
Oncotype DX (rowname: oncotype) (Knezevic et al. 2013)
Prolaris (rowname: prolaris) (NICE Advice 2018)

Further, the 20-gene Androgen Receptor (AR) score is calculated as described in the TCGA’s Cell 2015 paper:

AR score (rowname: ar_score) (Abeshouse et al. 2015)

5 Single study example for Taylor et al.

Here, a brief example on how to download and process a single study is provided. The example data is of Taylor et al. (Taylor et al. 2010), also known as the MSKCC dataset.

5.1 Downloading the MAE-object

A character vector with the short study ID is used to download the MAE object; we will focus only on the primary prostate cancer samples and CNA (GISTIC) and GEX:

taylor <- getPCa("taylor", assays = c("gex.rma", "cna.gistic"), sampletypes = "primary")

class(taylor)
#> [1] "MultiAssayExperiment"
#> attr(,"package")
#> [1] "MultiAssayExperiment"

taylor
#> A MultiAssayExperiment object of 2 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 2:
#>  [1] cna.gistic: matrix with 17832 rows and 157 columns
#>  [2] gex.rma: matrix with 17410 rows and 131 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

One typical end-point is an object of type Surv, exported from the survival-package. We will create this end-point for biochemical recurrence:

library(survival)
# BCR events
taylor_bcr <- colData(taylor)$disease_specific_recurrence_status
# BCR events / censoring follow-up time
taylor_fu <- colData(taylor)$days_to_disease_specific_recurrence

taylor_surv <- Surv(event = taylor_bcr, time = taylor_fu)

class(taylor_surv)
#> [1] "Surv"

head(taylor_surv)
#> [1]  564  1770+ 2841+ 4653+ 3846+ 4909+

With the response vector of type Surv, one can plot and analyze multiple survival modelling related tasks.

5.2 Kaplan-Meier curves

One of the most common ways to depict survival curves (here BCR events), is a Kaplan-Meier (KM) curve. Gleason grade is known to be a good prognostic factor for BCR, thus a KM curve in respect to biopsy Gleason grade shows differences in prognosis.

For this visualization, package survminer offers a variety of functions building on top of ggplot:

library(survminer)

taylor_bcr_gleason <- data.frame(bcr = taylor_surv, gleason = colData(taylor)$gleason_grade)

fit <- survfit(bcr ~ gleason, data = taylor_bcr_gleason)
ggsurvplot(fit, data = taylor_bcr_gleason, ylab = "Biochemical recurrence free proportion",
    risk.table = TRUE, size = 1, pval = TRUE, ggtheme = theme_bw())

For more settings for the KM plots, see documentation for survminer::ggsurvplot.

5.3 Cox regression

Functions longFormat and wideFormat from MultiAssayExperiment are essential for extracting multi-omics data and metadata in the right format. For the purposes of Cox regression, we will utilize wideFormat:

taylor_coxdat <- MultiAssayExperiment::wideFormat(taylor["PTEN", , ], colDataCols = c("age_at_initial_diagnosis",
    "gleason_grade", "disease_specific_recurrence_status", "days_to_disease_specific_recurrence"))

taylor_coxdat <- as.data.frame(taylor_coxdat)

taylor_coxdat$y <- Surv(time = taylor_coxdat$days_to_disease_specific_recurrence,
    event = taylor_coxdat$disease_specific_recurrence_status)

head(taylor_coxdat)
#>   primary age_at_initial_diagnosis gleason_grade
#> 1 PCA0001                       NA             7
#> 2 PCA0002                       NA             8
#> 3 PCA0003                       NA             7
#> 4 PCA0004                       NA             7
#> 5 PCA0005                       NA             7
#> 6 PCA0006                       NA             6
#>   disease_specific_recurrence_status days_to_disease_specific_recurrence
#> 1                                  1                                 564
#> 2                                  0                                1770
#> 3                                  0                                2841
#> 4                                  0                                4653
#> 5                                  0                                3846
#> 6                                  0                                4909
#>   cna.gistic_PTEN gex.rma_PTEN     y
#> 1               0     7.311643   564
#> 2               0     7.304157 1770+
#> 3               0     7.205469 2841+
#> 4               0           NA 4653+
#> 5               0     7.059756 3846+
#> 6              -1           NA 4909+

We’ll construct a simple Cox proportional hazards model with few variables; PTEN is a known tumor suppressor gene, so changes in its copy number or gene expression levels could also play a role in biochemical recurrence.

coxmodel <- coxph(y ~ cna.gistic_PTEN + gex.rma_PTEN + gleason_grade, data = taylor_coxdat)
coxmodel
#> Call:
#> coxph(formula = y ~ cna.gistic_PTEN + gex.rma_PTEN + gleason_grade, 
#>     data = taylor_coxdat)
#> 
#>                    coef exp(coef) se(coef)      z        p
#> cna.gistic_PTEN -0.8227    0.4392   0.3613 -2.277   0.0228
#> gex.rma_PTEN     1.6762    5.3450   0.8432  1.988   0.0468
#> gleason_grade    1.0904    2.9755   0.2307  4.727 2.28e-06
#> 
#> Likelihood ratio test=27.76  on 3 df, p=4.086e-06
#> n= 108, number of events= 24 
#>    (71 observations deleted due to missingness)

In this case we took the GISTIC normalized PTEN amplification as well as its RMA-normalized gene expression. We notice that both are statistically significant Cox regression coefficients together with Gleason grade.

6 Session info

sessionInfo()
#> R version 4.5.0 beta (2025-04-02 r88102)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] survminer_0.5.0             ggpubr_0.6.0               
#>  [3] ggplot2_3.5.2               survival_3.8-3             
#>  [5] curatedPCaData_1.5.0        RaggedExperiment_1.33.0    
#>  [7] MultiAssayExperiment_1.35.0 SummarizedExperiment_1.39.0
#>  [9] Biobase_2.69.0              GenomicRanges_1.61.0       
#> [11] GenomeInfoDb_1.45.0         IRanges_2.43.0             
#> [13] MatrixGenerics_1.21.0       matrixStats_1.5.0          
#> [15] S4Vectors_0.47.0            BiocGenerics_0.55.0        
#> [17] generics_0.1.3              BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3               gridExtra_2.3           formatR_1.14           
#>   [4] rlang_1.1.6             magrittr_2.0.3          compiler_4.5.0         
#>   [7] RSQLite_2.3.9           png_0.1-8               vctrs_0.6.5            
#>  [10] reshape2_1.4.4          stringr_1.5.1           pkgconfig_2.0.3        
#>  [13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.6           
#>  [16] backports_1.5.0         dbplyr_2.5.0            XVector_0.49.0         
#>  [19] labeling_0.4.3          KMsurv_0.1-5            rmarkdown_2.29         
#>  [22] markdown_2.0            UCSC.utils_1.5.0        tinytex_0.57           
#>  [25] purrr_1.0.4             bit_4.6.0               xfun_0.52              
#>  [28] cachem_1.1.0            litedown_0.7            jsonlite_2.0.0         
#>  [31] blob_1.2.4              DelayedArray_0.35.0     broom_1.0.8            
#>  [34] R6_2.6.1                bslib_0.9.0             stringi_1.8.7          
#>  [37] car_3.1-3               jquerylib_0.1.4         Rcpp_1.0.14            
#>  [40] bookdown_0.43           knitr_1.50              zoo_1.8-14             
#>  [43] BiocBaseUtils_1.11.0    Matrix_1.7-3            splines_4.5.0          
#>  [46] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
#>  [49] ggtext_0.1.2            codetools_0.2-20        curl_6.2.2             
#>  [52] lattice_0.22-7          tibble_3.2.1            plyr_1.8.9             
#>  [55] withr_3.0.2             KEGGREST_1.49.0         evaluate_1.0.3         
#>  [58] BiocFileCache_2.17.0    xml2_1.3.8              survMisc_0.5.6         
#>  [61] ExperimentHub_2.17.0    Biostrings_2.77.0       pillar_1.10.2          
#>  [64] BiocManager_1.30.25     filelock_1.0.3          carData_3.0-5          
#>  [67] BiocVersion_3.22.0      commonmark_1.9.5        munsell_0.5.1          
#>  [70] scales_1.3.0            xtable_1.8-4            glue_1.8.0             
#>  [73] tools_4.5.0             AnnotationHub_3.17.0    data.table_1.17.0      
#>  [76] ggsignif_0.6.4          grid_4.5.0              tidyr_1.3.1            
#>  [79] AnnotationDbi_1.71.0    colorspace_2.1-1        GenomeInfoDbData_1.2.14
#>  [82] Formula_1.2-5           cli_3.6.4               km.ci_0.5-6            
#>  [85] rappdirs_0.3.3          S4Arrays_1.9.0          dplyr_1.1.4            
#>  [88] gtable_0.3.6            rstatix_0.7.2           sass_0.4.10            
#>  [91] digest_0.6.37           SparseArray_1.9.0       farver_2.1.2           
#>  [94] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
#>  [97] httr_1.4.7              mime_0.13               gridtext_0.1.5         
#> [100] bit64_4.6.0-1

References

Abeshouse, Adam, Jaeil Ahn, Rehan Akbani, Adrian Ally, Samirkumar Amin, Christopher D. Andry, Matti Annala, et al. 2015. “The Molecular Taxonomy of Primary Prostate Cancer.” Cell 163 (4): 1011–25. https://doi.org/10.1016/j.cell.2015.10.025.

Abida, Wassim, Joanna Cyrta, Glenn Heller, Davide Prandi, Joshua Armenia, Ilsa Coleman, Marcin Cieslik, et al. 2019. “Genomic Correlates of Clinical Outcome in Advanced Prostate Cancer.” Proceedings of the National Academy of Sciences 116 (23): 11428–36. https://doi.org/10.1073/pnas.1902651116.

Baca, Sylvan C., Davide Prandi, Michael S. Lawrence, Juan Miguel Mosquera, Alessandro Romanel, Yotam Drier, Kyung Park, et al. 2013. “Punctuated Evolution of Prostate Cancer Genomes.” Cell 153 (3): 666–77. https://doi.org/10.1016/j.cell.2013.03.021.

Barbieri, Christopher E, Sylvan C Baca, Michael S Lawrence, Francesca Demichelis, Mirjam Blattner, Jean-Philippe Theurillat, Thomas A White, et al. 2012. “Exome Sequencing Identifies Recurrent SPOP, FOXA1 and MED12 Mutations in Prostate Cancer.” Nature Genetics 44 (6): 685–89. https://doi.org/10.1038/ng.2279.

Barwick, B G, M Abramovitz, M Kodani, C S Moreno, R Nam, W Tang, M Bouzyk, A Seth, and B Leyland-Jones. 2010. “Prostate Cancer Genes Associated with TMPRSS2ERG Gene Fusion and Prognostic of Biochemical Recurrence in Multiple Cohorts.” British Journal of Cancer 102 (3): 570–76. https://doi.org/10.1038/sj.bjc.6605519.

Chandran, Uma R, Changqing Ma, Rajiv Dhir, Michelle Bisceglia, Maureen Lyons-Weiler, Wenjing Liang, George Michalopoulos, Michael Becich, and Federico A Monzon. 2007. “Gene Expression Profiles of Prostate Cancer Reveal Involvement of Multiple Molecular Pathways in the Metastatic Process.” BMC Cancer 7 (1). https://doi.org/10.1186/1471-2407-7-64.

Friedrich, Maik, Karolin Wiedemann, Kristin Reiche, Sven-Holger Puppel, Gabriele Pfeifer, Ivonne Zipfel, Stefanie Binder, et al. 2020. “The Role of lncRNAs TAPIR-1 and -2 as Diagnostic Markers and Potential Therapeutic Targets in Prostate Cancer.” Cancers 12 (5): 1122. https://doi.org/10.3390/cancers12051122.

Goldman, Mary J., Brian Craft, Mim Hastie, Kristupas Repečka, Fran McDade, Akhil Kamath, Ayan Banerjee, et al. 2020. “Visualizing and Interpreting Cancer Genomics Data via the Xena Platform.” Nature Biotechnology 38 (6): 675–78. https://doi.org/10.1038/s41587-020-0546-8.

Herlemann, Annika, Huei-Chung Huang, Ridwan Alam, Jeffery J. Tosoian, Hyung L. Kim, Eric A. Klein, Jeffry P. Simko, et al. 2019. “Decipher Identifies Men with Otherwise Clinically Favorable-Intermediate Risk Disease Who May Not Be Good Candidates for Active Surveillance.” Prostate Cancer and Prostatic Diseases 23 (1): 136–43. https://doi.org/10.1038/s41391-019-0167-9.

Hieronymus, Haley, Nikolaus Schultz, Anuradha Gopalan, Brett S. Carver, Matthew T. Chang, Yonghong Xiao, Adriana Heguy, et al. 2014. “Copy Number Alteration Burden Predicts Prostate Cancer Relapse.” Proceedings of the National Academy of Sciences 111 (30): 11139–44. https://doi.org/10.1073/pnas.1411446111.

Jia, Zhenyu, Yipeng Wang, Anne Sawyers, Huazhen Yao, Farahnaz Rahmatpanah, Xiao-Qin Xia, Qiang Xu, et al. 2011. “Diagnosis of Prostate Cancer Using Differentially Expressed Genes in Stroma.” Cancer Research 71 (7): 2476–87. https://doi.org/10.1158/0008-5472.can-10-2585.

Kim, Hyung L., Ping Li, Huei-Chung Huang, Samineh Deheshi, Tara Marti, Beatrice Knudsen, Hatem Abou-Ouf, et al. 2018. “Validation of the Decipher Test for Predicting Adverse Pathology in Candidates for Prostate Cancer Active Surveillance.” Prostate Cancer and Prostatic Diseases 22 (3): 399–405. https://doi.org/10.1038/s41391-018-0101-6.

Knezevic, Dejan, Audrey D Goddard, Nisha Natraj, Diana B Cherbavaz, Kim M Clark-Langone, Jay Snable, Drew Watson, et al. 2013. “Analytical Validation of the Oncotype DX Prostate Cancer Assay a Clinical RT-PCR Assay Optimized for Prostate Needle Biopsies.” BMC Genomics 14 (1): 690. https://doi.org/10.1186/1471-2164-14-690.

Kunderfranco, Paolo, Maurizia Mello-Grand, Romina Cangemi, Stefania Pellini, Afua Mensah, Veronica Albertini, Anastasia Malek, Giovanna Chiorino, Carlo V. Catapano, and Giuseppina M. Carbone. 2010. “ETS Transcription Factors Control Transcription of EZH2 and Epigenetic Silencing of the Tumor Suppressor Gene Nkx3.1 in Prostate Cancer.” Edited by Chad Creighton. PLoS ONE 5 (5): e10547. https://doi.org/10.1371/journal.pone.0010547.

Laajala, Teemu D., Varsha Sreekanth, Alex C. Soupir, Jordan H. Creed, Anni S. Halkola, Federico C. F. Calboli, Kalaimathy Singaravelu, et al. 2023. “A Harmonized Resource of Integrated Prostate Cancer Clinical, -Omic, and Signature Features.” Scientific Data 10 (1). https://doi.org/10.1038/s41597-023-02335-4.

Longoni, N, P Kunderfranco, S Pellini, D Albino, M Mello-Grand, S Pinton, G DAmbrosio, et al. 2012. “Aberrant Expression of the Neuronal-Specific Protein DCDC2 Promotes Malignant Phenotypes and Is Associated with Prostate Cancer Progression.” Oncogene 32 (18): 2315–24. https://doi.org/10.1038/onc.2012.245.

NICE Advice. 2018. “Prolaris Gene Expression Assay for Assessing Long-Term Risk of Prostate Cancer Progression.” BJU International 122 (2): 173–80. https://doi.org/10.1111/bju.14452.

Peraldo-Neia, Caterina, Giorgia Migliardi, Maurizia Mello-Grand, Filippo Montemurro, Raffaella Segir, Ymera Pignochino, Giuliana Cavalloni, et al. 2011. “Epidermal Growth Factor Receptor (EGFR) Mutation Analysis, Gene Expression Profiling and EGFR Protein Expression in Primary Prostate Cancer.” BMC Cancer 11 (1). https://doi.org/10.1186/1471-2407-11-31.

Ren, Shancheng, Gong-Hong Wei, Dongbing Liu, Liguo Wang, Yong Hou, Shida Zhu, Lihua Peng, et al. 2018. “Whole-Genome and Transcriptome Sequencing of Prostate Cancer Identify New Genetic Alterations Driving Disease Progression.” European Urology 73 (3): 322–39. https://doi.org/10.1016/j.eururo.2017.08.027.

Sturm, Gregor, Francesca Finotello, Florent Petitprez, Jitao David Zhang, Jan Baumbach, Wolf H Fridman, Markus List, and Tatsiana Aneichyk. 2019. “Comprehensive Evaluation of Transcriptome-Based Cell-Type Quantification Methods for Immuno-Oncology.” Bioinformatics 35 (14): i436–i445. https://doi.org/10.1093/bioinformatics/btz363.

Sun, Yijun, and Steve Goodison. 2009. “Optimizing Molecular Signatures for Predicting Prostate Cancer Recurrence.” The Prostate 69 (10): 1119–27. https://doi.org/10.1002/pros.20961.

Taylor, Barry S., Nikolaus Schultz, Haley Hieronymus, Anuradha Gopalan, Yonghong Xiao, Brett S. Carver, Vivek K. Arora, et al. 2010. “Integrative Genomic Profiling of Human Prostate Cancer.” Cancer Cell 18 (1): 11–22. https://doi.org/10.1016/j.ccr.2010.05.026.

True, Lawrence, Ilsa Coleman, Sarah Hawley, Ching-Ying Huang, David Gifford, Roger Coleman, Tomasz M. Beer, et al. 2006. “A Molecular Correlate to the Gleason Grading System for Prostate Adenocarcinoma.” Proceedings of the National Academy of Sciences 103 (29): 10991–6. https://doi.org/10.1073/pnas.0603678103.

Wallace, Tiffany A., Robyn L. Prueitt, Ming Yi, Tiffany M. Howe, John W. Gillespie, Harris G. Yfantis, Robert M. Stephens, Neil E. Caporaso, Christopher A. Loffredo, and Stefan Ambs. 2008. “Tumor Immunobiological Differences in Prostate Cancer Between African-American and European-American Men.” Cancer Research 68 (3): 927–36. https://doi.org/10.1158/0008-5472.can-07-2608.

Wang, Yipeng, Xiao-Qin Xia, Zhenyu Jia, Anne Sawyers, Huazhen Yao, Jessica Wang-Rodriquez, Dan Mercola, and Michael McClelland. 2010. “In Silico Estimates of Tissue Components in Surgical Samples Based on Expression Profiling Data.” Cancer Research 70 (16): 6448–55. https://doi.org/10.1158/0008-5472.can-10-0021.

Weiner, Adam B., Thiago Vidotto, Yang Liu, Adrianna A. Mendes, Daniela C. Salles, Farzana A. Faisal, Sanjana Murali, et al. 2021. “Plasma Cells Are Enriched in Localized Prostate Cancer in Black Men and Are Associated with Improved Outcomes.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-21245-w.

Zhang, Junjun, Rosita Bajari, Dusan Andric, Francois Gerthoffert, Alexandru Lepsa, Hardeep Nahal-Bose, Lincoln D. Stein, and Vincent Ferretti. 2019. “The International Cancer Genome Consortium Data Portal.” Nature Biotechnology 37 (4): 367–69. https://doi.org/10.1038/s41587-019-0055-9.

Overview to curatedPCaData

17 April 2025

Contents