R version: R version 3.6.0 (2019-04-26)
Bioconductor version: 3.9
Package version: 1.0.1

1 Introduction

The development of microarrays and more recently the rapid uptake of RNA-sequencing technologies have provided a platform to examine the transcriptional profile (or transcriptome) of biological samples (Cieślik and Chinnaiyan 2018). Transcriptomic analyses have traditionally focused on ‘differential expression’ of genes between sets of samples, however with the rapid growth of publicly available RNA data there has been increasing usage of ‘relative approaches’, which quantify the relative concordance of a sample or samples with a specific gene signature (Cieślik and Chinnaiyan 2018). While sequencing of genomic mutations has been important for classifying different tumour subsets based upon the presence of mutations or fusion genes, and identifying genetic lesions which may act as drivers of cancer progression, transcriptomic profiling can provide further information on the state or phenotype of cells carrying these mutations.

Cancers are a heterogeneous set of diseases with a number of clinical and pathological subtypes. In diseases such as breast cancer the primary clinical classifications relate to the expression of hormone receptors (estrogen receptor: ER; and progesterone receptor: PR) or the overexpression of Erb-B2 receptor tyrosine kinase (HER2), as these features can be directly targeted with therapeutic agents. A common example of transcriptomic or gene expression data informing clinical practice is the use of prediction analysis of microarray 50 (PAM50) signatures for distinguishing the intrinsic breast cancer subtypes (Parker et al. 2009, Cieślik and Chinnaiyan (2018)). For many other cancers, subtype classification has largely relied upon identifying sets of recurrent mutations across large patient cohorts, with whole genome or whole exome sequencing studies helping to resolve the clinically significant subtypes (Cancer Genome Atlas Research Network 2013, Papaemmanuil et al. (2016)).

Perhaps the most well-known ‘relative approach’ is single-sample gene set enrichment analysis (ssGSEA) (Barbie et al. 2009), often used through the GenePattern web-tool. Another common approach is gene set variation analysis (GSVA) (Hänzelmann, Castelo, and Guinney 2013) which is available as an R/Bioconductor package that also includes functionality for ssGSEA, an alternative approach known as PLAGE (Tomfohr, Lu, and Kepler 2005), and a z-score based approach (Lee et al. 2008). Both ssGSEA and GSVA use a Kolmogorov-Smirnov like random-walk statistic to convert normalised gene ranks to the resulting score, however this normalisation procedure means that the scores are not truly ‘single-sample’, and variations in the overall sample composition for a study (e.g. variations in the presence or relative frequency of different cancer subtypes) can lead to unexpected changes in sample scores. Furthermore, the resultant scores from these methods can vary in their range and absolute value, making them difficult to interpret without further processing. To overcome this, we have developed a single-sample gene set scoring method singscore (Foroutan et al. 2018) which simply uses the ranks of genes within a given set, normalised relative to the maximum and minimum theoretical scores for a gene set of a given size.

Through large scale efforts such as The Cancer Genome Atlas (TCGA), transcriptomic data are available for thousands of clinical samples, often together with corresponding genomic or epigenomic (often DNA methylation) data. These transcriptomic data can help to characterise the functional effects of corresponding mutations, and provide a window to study the heterogeneity which arises within different subtypes of cancer due to epigenetic and transcriptional regulatory programs which can also influence cell behaviour. Here, we demonstrate that the single-sample gene set scoring method singscore (Foroutan et al. 2018) can be used to classify TCGA AML samples using transcriptional ‘gene signatures’ for the NPM1c mutation, KMT2A (MLL) gene fusions, and PML-RARA gene fusions that were derived from independent studies. Without any need for parameter fitting or estimation, we show that gene set scoring with singscore can distinguish samples carrying these mutations. The case studies we present demonstrate the application of gene set scoring to examine not only the differences, but also the relative similarities between established subtypes of AML that impact clinical outcome.

2 Description of the biological problem

As with most cancers, acute myeloid leukemia (AML) is a heterogeneous disease with a number of classified subtypes. Analysis of TCGA AML genomic data identified a number of subtypes based upon the presence or absence of specific ‘driver mutations’; recapitulating and expanding upon previously identified clinical subsets (Cancer Genome Atlas Research Network 2013). A more recent study which focused primarily on genomic data has further refined the clinically significant AML subtypes (Papaemmanuil et al. 2016), highlighting a number of co-occurring as well as mutually exclusive mutations.

Of note for this work, one of the most common mutations in clinical AML samples is a frameshift mutation within exon 12 of the nucleophosmin (NPM1) gene (Papaemmanuil et al. 2016). This mutation leads to aberrant localisation of nucleophosmin with cytoplasmic accumulation rather than localising to the nucleolus, and accordingly this mutation is often referred to as the NPM1c mutation (Brunetti et al. 2018). As noted by Verhaak et al. (2005), the NPM1c mutation is associated with dysregulated activity of the homeobox domain (Hox) family of transcription factors which are essential for developmental patterning. The effects of this mutation in disease progression have been further demonstrated in recent work which showed that loss of NPM1c leads to differentiation of AML cells (Brunetti et al. 2018).

Further recurrent genetic lesions in AML relevant for this work include lysine methyl transferase 2A (KMT2A; previously known as MLL) fusion genes, partial tandem duplications within KMT2A (KMT2A-PTD), and fusion genes between promyelocytic leukemia and retinoic acid receptor alpha (PML-RARA). Given the role of NPM1c in dysregulating the Hox gene family, it is interesting to note that AML samples with MLL fusion genes also show dysregulated expression of Hox family genes (Hess 2004, Ross et al. (2004)); however, samples with MLL-PTD appear to show a relatively distinct phenotype from MLL-fusion samples (Ross et al. 2004). While there is good evidence demonstrating the role of NPM1c mutations and other genetic lesions in blocking AML cell differentiation, the PML-RARA fusion subset is diagnostic for a specific subset of AML known as acute promyelocytic leukemia (APL). This clinically distinct subtype of AML is associated with a specific morphology under the French-American-British (FAB) classification of AML, FAB-M3, with cells showing a distinct morphology due to a differentiation block at the promyelocyte stage (Thé et al. 1991).

In this workflow we demonstrate the ability of the singscore method for single sample gene set scoring (Foroutan et al. 2018) to classify tumour ‘driver mutations’ from transcriptomic data. We use a previously identified gene signature for the NPM1c mutation (Verhaak et al. 2005). We also use signatures for PML-RARA gene fusions and MLL-fusions that were derived using pediatric AML samples but shown to work well for classifying adult AML samples with similar lesions (Ross et al. 2004), although we note that there is evidence of relatively large differences in the mutational profiles of adult and pediatric cancers (Ma et al. 2018). Using these signatures, which are included within the molecular signatures database (MSigDB) (Liberzon et al. 2015), we demonstrate that a bi-directional scoring approach can classify TCGA AML samples with corresponding mutations with a good precision and recall. A particularly useful feature of gene set scoring is the ability to project samples onto 2D or higher-order landscapes defined by corresponding phenotypic signatures. Accordingly, by comparing scores for both the NPM1c and KMT2A-/MLL-fusion signatures, we show that this classification likely arises through the shared downstream biological effects of Hox family dysregulation. We also compare the NPM1c mutation signature to the PML-RARA signature and show a clear separation of these subtypes reflecting their divergent phenotypes and the mutually exclusive nature of these mutations.

3 Downloading and preparing the data

Data from TCGA project is made available through the Genomic Data Commons (GDC). Open access data from the project can be accessed in multiple pre-processed formats. Transcriptomic data can be downloaded either at the count level or as FPKM transformed abundance, before or after upper quantile normalisation. Other pre-processed version can be found from sources such as the cBioPortal and FireBrowse. The GDC data used STAR to perform a two-pass alignment followed by quantification using HTSeq. Data from the GDC can be downloaded using the GDC data transfer tool which allows users to select the specific files of interest using the GDC portal. These files then have to be read, merged, annotated and processed into a data structure that simplifies downstream analysis. Alternatively, all the above mentioned steps, including the download, can be performed using the R package TCGAbiolinks (Colaprico et al. 2015, Colaprico et al. (2019)). The package supports data download using the GDC API and the GDC client. We will use the TCGAbiolinks package to download, annotate and process the data into a SummarizedExperiment R object.

The following steps need to be performed to prepare the data:

  1. Create a query to select the files to download
  2. Execute the query and download the data
  3. Read the data into R
  4. Filter out genes with low expression
  5. Normalise the data for compositional bias and transform to account for gene-length biases as outlined in the singscore manuscript (Foroutan et al. 2018)

3.1 Querying the GDC database

The first step in any analysis should be to determine and report the data version and the service used to download the data. The getGDCInfo() function returns the release date of all data on the GDC along with a version.

library(SingscoreAMLMutations)
library(TCGAbiolinks)

#get GDC version information
gdc_info = getGDCInfo()
gdc_info
## $commit
## [1] "3e22a4257d5079ae9f7193950b51ed9dfc561ed1"
## 
## $data_release
## [1] "Data Release 17.0 - June 05, 2019"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "1.21.0"
## 
## $version
## [1] 1

A query then needs to be run, using the GDC to identify the specific files for download. This step is similar to generating a MANIFEST file using the GDC portal. The first parameter of the query specifies the project - available projects can be accessed using getGDCprojects() or from https://portal.gdc.cancer.gov/projects. The TCGA acute myeloid leukemia data is part of the TCGA-LAML project. Following this, the data category, data type and workflow type need to be specified. The query formed below selects files containing the count level transcriptomic measurements. Values for different parameters of the query can be identified from “searching arguments” section of the “query” vignette: vignette("query", package = "TCGAbiolinks"). The result of this query will be a dataframe containing filenames and additional annotations related to the files.

Read count level data are selected instead of the processed FPKM data as one of the downstream pre-processing analysis results in filtering out of genes. A general recommendation is to compute FPKM values after filtering genes out so as to ensure counts are normalised by the corresponding library sizes. In cases where count-level data is not available, filtering can be performed directly on FPKM values, provided that the library size is large enough.

#form a query for the RNAseq data
query_rna = GDCquery(
  #getGDCprojects()
  project = 'TCGA-LAML',
  #TCGAbiolinks:::getProjectSummary('TCGA-LAML')
  data.category = 'Transcriptome Profiling',
  data.type = 'Gene Expression Quantification',
  workflow.type = 'HTSeq - Counts'
)

#extract results of the query
rnaseq_res = getResults(query_rna)
dim(rnaseq_res)
## [1] 151  28
colnames(rnaseq_res)
##  [1] "data_release"              "data_type"                
##  [3] "updated_datetime"          "file_name"                
##  [5] "submitter_id"              "file_id"                  
##  [7] "file_size"                 "cases"                    
##  [9] "id"                        "created_datetime"         
## [11] "md5sum"                    "data_format"              
## [13] "access"                    "state"                    
## [15] "version"                   "data_category"            
## [17] "type"                      "experimental_strategy"    
## [19] "project"                   "analysis_id"              
## [21] "analysis_updated_datetime" "analysis_created_datetime"
## [23] "analysis_submitter_id"     "analysis_state"           
## [25] "analysis_workflow_link"    "analysis_workflow_type"   
## [27] "analysis_workflow_version" "tissue.definition"

3.2 Downloading the TCGA AML RNA-seq read counts

The GDCdownload function then executes the query on the GDC database and begins downloading the data using the GDC API. The download method should be changed to “client”, if the size of the data is expected to be large, e.g for RNA-seq read data or methylation data. It is good practice to specify the directory for data storage - we store all the data in the “GDCdata” directory in the temporary directory. Users should store their data in a permanent storage to retain the data. The function downloads the data and organises it into the folder based on parameters specified in the query. This allows multiple different levels and types of data to be stored in the same directory structure. Files with counts are stored at TEMPDIR/GDCdata/TCGA-LAML/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/.

datapath = file.path(tempdir(), 'GDCdata')
GDCdownload(query_rna, directory = datapath) #(size: 39MB)

3.3 Reading count-level data into R

The GDCprepare function reads and processes the downloaded data into a RangedSummarizedExperiment object from the SummarizedExperiment package (Morgan et al. 2019) which allows patient annotations, gene annotations and count data to be stored in one object. Patient annotations are downloaded upon calling this function and subsequently mapped and attached to the resulting object. A RangedSummarizedExperiment object is similar to an ExpressionSet object but provides added functionality such as indexing with genomic coordinates and storing multiple data matrices with the same structure. Feature annotations used to annotate the data are stored in an RDA/RDATA file.

aml_se = GDCprepare(query_rna, directory = datapath)

The object contains data for 56,925 features and 151 samples. The original data files contain 60,483 features, some of which (3,881) could not be mapped to ENSEMBL GRCh38.p12. Feature and sample annotations can be accessed using rowData(se) and colData(se), respectively, and the counts data can be accessed using assay(se). TCGA data usually contains some formalin-fixed paraffin-embedded (FFPE) samples which should be discarded from the analysis as the protocol introduces biological artefacts. This procedure is only performed on solid tumours and not leukemias, therefore, no filtering is required for this data set.

aml_se
## class: RangedSummarizedExperiment 
## dim: 56602 151 
## metadata(1): data_release
## assays(1): HTSeq - Counts
## rownames(56602): ENSG00000000003 ENSG00000000005 ... ENSG00000281912
##   ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id
## colnames(151): TCGA-AB-2921-03A-01T-0740-13
##   TCGA-AB-2932-03A-01T-0740-13 ... TCGA-AB-2910-03A-01T-0740-13
##   TCGA-AB-3012-03A-01T-0736-13
## colData names(61): sample patient ... name is_ffpe

3.4 Filter out genes with low counts

The edgeR package (Chen et al. 2019) contains methods that assist in the data normalisation and transformation required for filtering and subsequent steps. The methods require a DGEList object therefore we begin by creating a DGEList for the AML data from the SummarizedExperiment.

library(SummarizedExperiment)
library(edgeR)

aml_dge = DGEList(counts = assay(aml_se), genes = rowData(aml_se))

Genes with low counts across most samples are discarded from the analysis. This is a standard step in differential expression analysis as inclusion of such genes in the analysis could skew estimates of dispersion. It is also motivated in rank-based analysis, such as with singscore, to avoid rank duplication. Rank duplication reduces the discriminant power of scores as the number of unique ranks is reduced. A commonly used filter is to select only those genes that have CPMs above a certain threshold across a proportion of samples. Filtering is performed on the CPMs rather than raw counts as the former accounts for variation in library sizes, therefore, is unbiased. For instance, a CPM of 1 would equate to read counts between 19 and 50 for samples in the AML data where library sizes vary between 18.6 and 49.7 million reads. Here, we retain genes that have a CPM > 1 across more than 50% of the samples. Other methods to filter out genes with low counts exist and may be preferable in specific applications. Chen, Lun, and Smyth (2016) and Law et al. (2016) filter genes based on the experimental design whereby the proportion of samples with enough read counts are evaluated per experimental group. As the AML data have many samples, filtering is performed across all samples rather than within sub-groups. Group specific filtering would be recommended for the study of rare groups. The distribution of logCPMs is much closer to the expected log-normal distribution after filtering out genes with low counts as seen in Figure 1.

prop_expressed = rowMeans(cpm(aml_dge) > 1)
keep = prop_expressed > 0.5

op = par(no.readonly = TRUE)
par(mfrow = c(1, 2))
hist(cpm(aml_dge, log = TRUE), main = 'Unfiltered', xlab = 'logCPM')
abline(v = log(1), lty = 2, col = 2)
hist(cpm(aml_dge[keep, ], log = TRUE), main = 'Filtered', xlab = 'logCPM')
abline(v = log(1), lty = 2, col = 2)
Histogram of logCPM values for the AML data before and after filtering. Filtering results in fewer zeros in the data. Most genes with CPM less than 1, logCPM < 0; (red line) across the majority of samples get discarded, resulting in an approximately log-normal distribution.

Figure 1: Histogram of logCPM values for the AML data before and after filtering
Filtering results in fewer zeros in the data. Most genes with CPM less than 1, logCPM < 0; (red line) across the majority of samples get discarded, resulting in an approximately log-normal distribution.

par(op)
#subset the data
aml_dge = aml_dge[keep, , keep.lib.sizes = FALSE]
aml_se = aml_se[keep, ]

3.5 Transformation to FPKM values and normalisation

Singscore requires gene expression measurements to be comparable between genes within a sample, therefore, correction for gene length bias needs to be performed (Oshlack and Wakefield 2009). Transformations such as transcripts per million (TPM) and reads/fragments per kilobase per million (RPKM/FPKM), that normalise by gene length, may be used. Both- TPM and RPKM/FPKM values should produce similar results when applying singscore provided that the library size is large enough, which they are here. RPKM values are generally computed after correcting for compositional biases. The calcNormFactors function in edgeR provides three methods to do so, TMM normalisation being the default. Chen, Lun, and Smyth (2016) and Law et al. (2016) discuss the implications of normalisation prior to down-stream processing such as differential expression analysis. Normlisation is generally performed for cross-sample analysis where samples need to be comparable. Singscores are invariant to data normalisation as the analysis is contained within the sample of interest. The idea extends to any transformation that preserves the relative ranks of genes within a sample such as a log transformation. Here, we use TMM normalisation solely for visualisation purposes.

Data transformation to TPM or RPKM/FPKM requires the lengths for all genes to be calculated. Gene lengths need to be computed based on the alignment and quantification parameters. The TCGA transcriptomic data has been aligned using STAR and quantified using HTSeq (details of the pipeline available at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/). HTSeq quantifies reads mapping to the exons of each gene, therefore, effective gene lengths can be calculated as the sum of all exons spanning the gene. The GENCODE v22 annotation file was used for quantification therefore the same file needs to be used to compute gene lengths.

#download v22 of the GENCODE annotation
gencode_file = 'gencode.v22.annotation.gtf.gz'
gencode_link = paste(
  'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22',
  gencode_file,
  sep = '/'
  )
download.file(gencode_link, gencode_file, method = 'libcurl') #(size: 39MB)

The rtracklayer R package (Lawrence, Carey, and Gentleman 2019) provides functions to help parse GTF files.

library(rtracklayer)
library(plyr)

gtf = import.gff(gencode_file, format = 'gtf', genome = 'GRCm38.71', feature.type = 'exon')
#split records by gene to group exons of the same gene
grl = reduce(split(gtf, elementMetadata(gtf)$gene_id))
gene_lengths = ldply(grl, function(x) {
  #sum up the length of individual exons
    return(c('gene_length' = sum(width(x))))
}, .id = 'ensembl_gene_id')

Genes are also annotated with their biotype for further analysis. The annotation file uses Ensembl IDs with versions as keys to records, which then need to be converted to Ensembl IDs. This is simply achieved by truncating the trailing version number.

#extract information on gene biotype
genetype = unique(elementMetadata(gtf)[, c('gene_id', 'gene_type')])
colnames(genetype)[1] = 'ensembl_gene_id'
gene_lengths = merge(genetype, gene_lengths)

#remove ENSEMBL ID version numbers
gene_lengths$ensembl_gene_id = gsub('\\.[0-9]*', '', gene_lengths$ensembl_gene_id)
saveRDS(gene_lengths, file = 'gene_lengths_HTSeq_gencodev22.rds')
gene_lengths
## DataFrame with 60483 rows and 3 columns
##       ensembl_gene_id      gene_type gene_length
##           <character>    <character>   <integer>
## 1     ENSG00000000003 protein_coding        4535
## 2     ENSG00000000005 protein_coding        1610
## 3     ENSG00000000419 protein_coding        1207
## 4     ENSG00000000457 protein_coding        6883
## 5     ENSG00000000460 protein_coding        5967
## ...               ...            ...         ...
## 60479 ENSGR0000275287       misc_RNA         290
## 60480 ENSGR0000276543          miRNA          68
## 60481 ENSGR0000277120          miRNA          64
## 60482 ENSGR0000280767        lincRNA         515
## 60483 ENSGR0000281849      antisense         484

The SummarizedExperiment object allows feature annotations to be stored, therefore, information on gene length and biotypes should be added to the existing annotations. Similarly, annotations need to be added to the DGEList object. The column containing lengths should include “length” in its name.

#allocate rownames for ease of indexing
rownames(gene_lengths) = gene_lengths$ensembl_gene_id
rowData(aml_se)$gene_length = gene_lengths[rownames(aml_se), 'gene_length']
rowData(aml_se)$gene_biotype = gene_lengths[rownames(aml_se), 'gene_type']

#annotate gene lengths for the DGE object
aml_dge$genes$length = gene_lengths[rownames(aml_dge), 'gene_length']

RPKM/FPKM values can now be calculated with the computed gene lengths after computing the normalisation factors. The SummarizedExperiment object can store multiple levels of the data simultaneously, provided that the number of features and samples remains the same across measurements. As such, FPKM values are appended to the existing object.

aml_dge_tmm = calcNormFactors(aml_dge, method = 'TMM')

#compute FPKM values and append to assays
assay(aml_se, 'logFPKM_TMM') = rpkm(aml_dge_tmm, log = TRUE)
aml_se
## class: RangedSummarizedExperiment 
## dim: 17425 151 
## metadata(1): data_release
## assays(2): HTSeq - Counts logFPKM_TMM
## rownames(17425): ENSG00000000419 ENSG00000000457 ... ENSG00000281772
##   ENSG00000281896
## rowData names(5): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id gene_length gene_biotype
## colnames(151): TCGA-AB-2921-03A-01T-0740-13
##   TCGA-AB-2932-03A-01T-0740-13 ... TCGA-AB-2910-03A-01T-0740-13
##   TCGA-AB-3012-03A-01T-0736-13
## colData names(61): sample patient ... name is_ffpe

3.6 Annotate samples with mutation data

For this analysis we have used the curated mutation list from the original TCGA AML publication (Cancer Genome Atlas Research Network 2013) (Supplemental Table 01 at https://gdc.cancer.gov/node/876) rather than variant calls from the standard TCGA pipeline (available through the National Cancer Institute Genomic Data Commons) and readers should note that there are some discrepancies between these. For genetic lesions of interest (NPM1c, KMT2A-MLL, KMT2A-PTD and PML-RARA), patients were identified by the following criteria:

  • Patient ID: The ‘TCGA Patient ID’ column was extracted directly
  • NPM1c: TRUE if the ‘NPM1’ column contains the strings ‘p.W287fs’ or ‘p.W288fs’
  • KMT2A-fusion: TRUE if the ‘MLL-partner’ column contains the string ‘MLL-’ or ‘-MLL’ (note that the official gene symbol for MLL is now KMT2A)
  • KMT2A-PTD: TRUE if the ‘MLL-PTD’ column contains the string ‘exons’
  • PML-RARA: TRUE if the ‘PML-RARA’ column contains the string ‘PML-RARA’
data(AMLPatientMutationsTCGA)
patient_mutations = AMLPatientMutationsTCGA
patient_mutations = patient_mutations[colnames(aml_se), ] # order samples
aml_mutations = colnames(patient_mutations) # get mutation labels
colData(aml_se) = cbind(colData(aml_se), patient_mutations)
colData(aml_se)[, aml_mutations]
## DataFrame with 151 rows and 4 columns
##                              NPM1c.Mut KMT2A.Fusion KMT2A.PTD  PML.RARA
##                              <logical>    <logical> <logical> <logical>
## TCGA-AB-2921-03A-01T-0740-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-2932-03A-01T-0740-13      TRUE        FALSE     FALSE     FALSE
## TCGA-AB-2971-03A-01T-0734-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-2987-03A-01T-0734-13      TRUE        FALSE     FALSE     FALSE
## TCGA-AB-2928-03A-01T-0740-13     FALSE        FALSE     FALSE     FALSE
## ...                                ...          ...       ...       ...
## TCGA-AB-3002-03A-01T-0736-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-2915-03A-01T-0740-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-2927-03A-01T-0740-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-2910-03A-01T-0740-13     FALSE        FALSE     FALSE     FALSE
## TCGA-AB-3012-03A-01T-0736-13     FALSE        FALSE     FALSE      TRUE

3.7 Map Ensembl IDs to Entrez IDs

Ensembl annotations (Ensembl IDs) have higher coverage of the genome which may be useful in applications such as variant calling and similar exploratory analysis (Zhao and Zhang 2015). However, RefSeq annotations (Entrez IDs) may be better suited to RNA-seq analyses which require a stable reference annotation for comparison (Wu, Phan, and Wang 2013). As such, we choose to map Ensembl IDs to Entrez IDs and discard any unmapped features.

Mapping can be performed using the Ensembl Biomart service, which can be queried using the biomaRt bioconductor package. This would provide the most up to date annotations. Alternatively, mapping could be performed with the bi-annually updated org.Hs.eg.db annotation R package (Carlson 2019) which provides a stable set of annotations, thereby enhancing reproducibility. Mapping is performed with the mapIds function in the AnnotationDbi R package (Pagès et al. 2019).

library(org.Hs.eg.db)

rowData(aml_se)$entrezgene = mapIds(
  org.Hs.eg.db,
  keys = rownames(aml_se),
  keytype = 'ENSEMBL',
  column = 'ENTREZID',
  multiVals = 'asNA'
  )
gene_annot = rowData(aml_se)

Multimapped Ensembl IDs are replaced by NAs, then discarded to enforce unique mapping. Similarly, Entrez IDs that map to multiple Ensembl IDs are identified from the mapping, and discarded. Only features with unique Ensembl ID to Entrez ID mappings remain.

#select genes with mapped Entrez IDs
keep = !is.na(gene_annot$entrezgene)

#select genes with unique Entrez IDs
dup_entrez = gene_annot$entrezgene[duplicated(gene_annot$entrezgene)]
keep = keep & !gene_annot$entrezgene %in% dup_entrez

#Biotype of discarded genes (due to non-unique mapping)
head(sort(table(gene_annot[!keep, 'gene_biotype']), decreasing = TRUE), n = 10)
## 
##               processed_pseudogene                          antisense 
##                               1017                                907 
##                            lincRNA                                TEC 
##                                451                                263 
##                     sense_intronic                     protein_coding 
##                                252                                139 
##             unprocessed_pseudogene transcribed_unprocessed_pseudogene 
##                                128                                 93 
##               processed_transcript   transcribed_processed_pseudogene 
##                                 72                                 72
#subset the data
aml_se = aml_se[keep, ]

4 Transcriptional signatures to predict mutation status

The signature by Verhaak et al. (2005) is now used to predict the mutation status of the NPM1c mutation. This is done by quantifying the concordance of genes in the signature with their expression in each sample. As such, high expression of up-regulated genes and low expression of down-regulated genes would result in higher scores. This single value can then be used to predict the mutation status of individual samples if these data were unavailable.

The signatures of interest are first downloaded from the MSigDB and read into GeneSet objects from the GSEABase R package (Morgan, Falcon, and Gentleman 2019). We then use the singscore R/Bioconductor package to quantify each sample for the Verhaak signature. Some of the visualisation and diagnostic tools within the singscore package are used to interpret the signatures and scores. Finally, we use a simple logistic regression model on the scores to predict the mutation status.

4.1 Download signature and load into R

The Verhaak et al. (2005) signature is composed of an up-regulated and a down-regulated gene set. Many signatures are developed in such a manner to improve discrimination of samples. MSigDB stores such signatures using names with suffixes “_UP" and “_DN" representing the independent components of the signature. Here, we form the download links for the signature with the base name “VERHAAK_AML_WITH_NPM1_MUTATED”.

#create signature names
verhaak_names = paste('VERHAAK_AML_WITH_NPM1_MUTATED', c('UP', 'DN'), sep = '_')
verhaak_names
## [1] "VERHAAK_AML_WITH_NPM1_MUTATED_UP" "VERHAAK_AML_WITH_NPM1_MUTATED_DN"

The signatures are then downloaded using the links, resulting in an XML file for each component of the signature. The mapply function is used to run the download function on all pairs of link-output arguments.

#generate URLs
verhaak_links = paste0(
  'http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=',
  verhaak_names,
  '&fileType=xml'
  )

#download files
verhaak_files = paste0(verhaak_names, '.xml')
mapply(download.file, verhaak_links, verhaak_files, method = 'libcurl')

Functions in the GSEABase package help with reading, parsing and processing the signatures. Signatures from an MSigDB XML file can be read using the getBroadSets function which results in a GeneSet object. Gene symbols, Entrez IDs and affymetrix chip IDs from the original experiment (HG-U133A in this case) are stored in the XML file. Entrez IDs are read from the file as these can be mapped directly to our data. Conversions to other identifiers can be achieved using the mapIdentifiers function from GSEABase and an annotation package that contains the mappings. The advantage of using this function instead of the mapIds function from the AnnotationDbi package is that the former retains the GeneSet object after conversion of IDs.

library(GSEABase)

verhaak_sigs = getBroadSets(verhaak_files, membersId = 'MEMBERS_EZID')
verhaak_sigs
## GeneSetCollection
##   names: VERHAAK_AML_WITH_NPM1_MUTATED_UP, VERHAAK_AML_WITH_NPM1_MUTATED_DN (2 total)
##   unique identifiers: 10051, 10135, ..., 9828 (429 total)
##   types in collection:
##     geneIdType: SymbolIdentifier (1 total)
##     collectionType: BroadCollection (1 total)

To make data indexing easier during signature scoring, row names of the SummarisedExperiment object are changed to Entrez IDs which are already part of the row annotations.

rownames(aml_se) = rowData(aml_se)$entrezgene

4.2 Score TCGA AML samples using the Verhaak signature

Singscore is a rank based metric of gene set enrichment in single samples. Scores for multiple signatures make use of the same ranked expression per sample. As such, it makes sense to compute the ranks only once and re-use them for scoring different signatures. The implementation separates these two phases of the analysis to reduce the computational cost of scoring. The rankGenes function will compute ranks from expression data in the form of either a numeric matrix, numeric data frame, ExpressionSet object, DGEList object or a SummarizedExperiment object. Users also have to specify what method should be used to break ties. The default is ‘min’ and we recommend this be used for RNA-seq data which may have many genes with zero counts. This will reduce the effect of zeros in the scores, however, appropriate pre-filtering of genes with low counts will still be required.

library(singscore)

#apply the rankGenes method to each version of the dataset, excluding counts
aml_ranked = rankGenes(assay(aml_se, 'logFPKM_TMM'))

Singscores can be computed using three modes, depending on the properties of the gene signature. The first mode of operation is applied when two directed gene sets (expected up- and down-regulated gene sets) form the transcriptomic signature. Many signatures in the MSigDB, including the Verhaak et al. (2005) signature come in such pairs. This mode can be invoked by passing the up- and down-regulated gene sets to the arguments upSet and downSet respectively. In some cases, only one set of genes forms the signature. If all genes in the gene set are up-regulated or all down-regulated, the second mode of operation applies and is invoked by passing the gene set to the upSet argument. For sets of down-regulated genes, the score would simply be inverted (-score if scores are centered, 1 - score otherwise). Finally, if the user is unsure of the composition of genes in the gene-set, such that, the gene set may contain both up- and down- regulated genes, the final mode of singscore applies. The gene set is passed to the upSet argument similar to the previous mode with the additional argument knownDirection set to FALSE.

By default, singscores are centered such that the range of scores is \([-1, 1]\) and \([-0.5, 0.5]\) for the first two modes respectively. Negative scores indicate an inverse enrichment of signatures, that is, expected up-regulated genes are in fact down-regulated and vice-versa. Scores from the last mode can not be centered and have the range \([0, 1]\). In this mode, high scores are obtained when ranks of genes are distant from the median and low scores obtained when ranks converge to the median rank. If scores are centered in this scenario, it would lead to the conclusion that a negative score shows inverted enrichment, which is not the case. Score centering only serves the purpose of easing interpretation for users, a simple linear transformation is applied to achieve it.

Scores for the NPM1c mutation signature are computed using the default settings, with the first mode of operation being used due to the presence of an up- and down- regulated gene set. The function returns a data frame reporting the score and dispersion of ranks for the up-regulated gene set, down-regulated gene set and the combination of both. Dispersion of the combined gene set in this mode is simply the mean of the independent dispersion estimates. If any gene names/IDs are present in the signature but missing in the expression data, a warning will be reported.

#apply the scoring function
verhaak_scores = simpleScore(aml_ranked,
                             upSet = verhaak_sigs[[1]],
                             downSet = verhaak_sigs[[2]])
## Warning in checkGenes(upSet, rownames(rankData)): 22 genes missing: 10265,
## 108, 10924, 11025, 11026, 1646, 1672, 200315, 2215, 3215, 3216, 3569, 3627,
## 3759, 50486, 6346, 6364, 861, 8843, 9103, 9518, 9627
## Warning in checkGenes(downSet, rownames(rankData)): 27 genes missing:
## 100505650, 10232, 10267, 2122, 221981, 2258, 23532, 23630, 24141, 25907,
## 2697, 28517, 28639, 3386, 3848, 3934, 4070, 445, 4680, 4681, 5457, 5790,
## 6091, 7441, 8277, 862, 8788

It should be noted that singscores are composed of two components, an enrichment score and a dispersion estimate of ranks. The quantity of interest in gene set enrichment is the distribution of the expression or ranks of genes in the signature. In an ideal scenario, all expected up-regulated genes would have high expression therefore higher values of ranks. As such, ranks would be distributed on the higher end of the entire rank spectrum. Singscore aims to quantify this distribution of ranks, therefore, computes and reports the average and dispersion of ranks of genes in the signature relative to all other genes. The first quantity is similar to scores computed from all other single sample scoring methods. We determined a two component score to be a more appropriate and accurate representation of the distribution of ranks of signature genes. The default and recommended measure of dispersion is the median absolute deviation (MAD) due to its non-parametric property. Other appropriate measure of dispersion could be the inter-quartile range (IQR) and can be used by passing the IQR function as an argument to the dispersionFun argument.

head(verhaak_scores)
##                              TotalScore TotalDispersion    UpScore
## TCGA-AB-2921-03A-01T-0740-13  0.1703684        4799.176  0.1207420
## TCGA-AB-2932-03A-01T-0740-13  0.4294443        2539.694  0.2502181
## TCGA-AB-2971-03A-01T-0734-13  0.3347857        3778.406  0.2187510
## TCGA-AB-2987-03A-01T-0734-13  0.3310311        3266.168  0.2491866
## TCGA-AB-2928-03A-01T-0740-13 -0.1317468        5916.315 -0.1301935
## TCGA-AB-2884-03A-01T-0735-13  0.3299788        3748.754  0.2202622
##                              UpDispersion    DownScore DownDispersion
## TCGA-AB-2921-03A-01T-0740-13     4517.482  0.049626338       5080.870
## TCGA-AB-2932-03A-01T-0740-13     2145.322  0.179226193       2934.065
## TCGA-AB-2971-03A-01T-0734-13     2727.984  0.116034790       4828.828
## TCGA-AB-2987-03A-01T-0734-13     1942.206  0.081844444       4590.130
## TCGA-AB-2928-03A-01T-0740-13     4960.780 -0.001553308       6871.851
## TCGA-AB-2884-03A-01T-0735-13     3071.947  0.109716528       4425.561

4.3 Diagnostics of the Verhaak signature

The singscore package provides a set of visualisation tools that enable diagnostics of the gene signature. For instance, these tools may be used to determine the importance of each component for a bidirectional signature (up- and down-regulated gene sets) to the total score, determine the importance of each gene of a signature in discriminating between the classes of interest, and to investigate the relationship between the final score and the dispersion of signature gene ranks. Annotations of interest can be overlayed on each plot. Singscore supports both continuous and categorical annotations, which can either be input as a vector, or as a string specifying a column within the score data frame. We begin by investigating the relationship between the score and dispersion of ranks for the up-regulated gene signature, down-regulated gene signature and the full signature. The plotDispersion functions generates a diagnostic plot with annotations overlaid. Annotations can be discrete or continuous, and can be passed as independent variables, or as a column name when the data is appended to the score data frame. It should be noted that all plotting functions in singscore can be made interactive by setting the isInteractive argument to TRUE.

#relative size of text in the figure
relSize = 1.2

#create annotation
mutated_gene = rep('Other', ncol(aml_se))
mutated_gene[aml_se$NPM1c.Mut] = 'NPM1c Mut'
mutated_gene[aml_se$KMT2A.Fusion | aml_se$KMT2A.PTD] = 'MLL Fusion/PTD'
p1 = plotDispersion(verhaak_scores, annot = mutated_gene, textSize = relSize)
p1