1 Introduction

In cancer studies, transcriptional signatures are studied because of their potential to show cancer activities while happening, and they are considered potentially useful to guide therapeutic decisions and monitoring interventions. Transcriptional signatures of bulk RNA-seq experiments are also used to assess the complex relations between the tumor and its microenvironment.

Transcriptional signatures are based upon the expression of a specific gene set and are summarized in a score designed to provide single-sample predictions. They are usually composed by a list of genes and an algorithm that through the use of gene expressions - and eventually a set of coefficients to differently weight the gene contributions - allows the computation of a single-sample prediction score.

Signatures show cancer activities in patients and can be used for patient stratification. The combined analysis of multiple signatures may reveal possible correlations between different tumor processes and allow patients to be stratified at a broader level of information. However, despite much evidence that computational implementations are useful to improve data reproducibility, applicability and dissemination, the vast majority of signatures are not published along with their computational code and only few of them have been implemented in a software, virtuous examples are: the R package consensusOV, dedicated to the TCGA ovarian cancer signature; and the R package genefu which hosts some of the most popular signatures of breast cancer.

signifinder has been developed to provide an easy and fast computation of several published signatures. Thanks to the compatibility with the Bioconductor data structures and procedures, signifinder can be easily used after the most popular expression data analysis packages to complement the results and improve data interpretations.

Several visualization functions are implemented to visualize the scores obtained from signatures. These can help in the result interpretations: users can not only browse single signatures independently but also compare them with each other.

2 Installation

To install this package:

if (!require("BiocManager", quietly = TRUE))


3 Criteria for gene expression signature inclusion

Stringent criteria for the inclusion of the signatures were established: (i) signatures are all based on a cancer topic, developed and used on cancer samples; (ii) all the signatures include the gene list and the method to calculate an expression-based score; (iii) signatures are based exclusively on transcriptomic data (exceptions have been made in case of combination of transcriptomic data and survival or histopathological data); all signatures have been developed for bulk tumor expression experiments. Additionally, the included signatures clearly state the method, the type of input data and the set of the considered genes. Signature genes have an official gene symbol (Hugo consortium) or an unequivocal translation versus this kind of annotation. Genes without an official gene symbol were removed and signatures with a total amount of untranslatable gene names greater than the 5% were not included.

4 How to use signifinder R package

4.1 The Input Expression Data

The input expression dataset must be normalized RNA-Seq counts (or normalized data matrix from microarrays) and they can be provided in the form of matrix, data frame or SummarizedExperiment. Regardless of the input type, the output data is a SummarizedExperiment with the computed signature scores added in the colData section.

Gene lists of signatures reported in literature are typically symbols, but signifinder can either use gene symbols, NCBI entrez or ensembl gene IDs. Users can say which of the three identifiers they use (SYMBOL, ENTREZID or ENSEMBL) to let the package convert the signature gene lists for the matching of gene data (nametype argument inside the signature functions). When a signature is computed a message is shown that says the percentage of genes used for the calculation of the signature compared to the original list. There is no minimum threshold of genes for signatures to be computed, but a warning will be given if there are less than the 30% of signature genes. After a signature has been calculated it is possible to visually inspect signature gene expressions using geneHeatmapSignPlot (see Gene Expression Heatmap).

Furthermore, providing the signatures the original works also specify the type of expression value (e.g. normalized value, TPM (transcript per million), log(TPM), etc…) that should be used to compute the signature. Therefore, during signature computation, data type should be eventually converted as reported in the original work. When using signifinder, users must supply the input data in the form of normalised counts (or normalised arrays) and, for the signatures which require this, a data transformation step will be automatically performed. The transformed data matrix will be included in the output as an additional assay and the name of the assay will be the name of the conversion (i.e. “TPM”, “CPM” or “FPKM”). Additionally, crucially important is to specify the type of data used: “microarray” or “rnaseq” (inputType argument inside the signature functions). Finally, included signatures have been developed both from array and RNA-seq data. In signifinder, signatures for microarray can be applied to RNA-seq data but not vice versa due to input type conversions. Alternatively, if the input data is a SummarizedExperiment object that contains (in addition to the normalized count) also an assay of the transformed data, this will be used directly. Note that in order to be used they must be called “TPM”, “CPM” or “FPKM”.

4.2 Computation of the Signatures

In the following we use an example expression dataset of ovarian cancer to show how to use signifinder with a standard workflow.

# loading packages
## class: SummarizedExperiment 
## dim: 1456 40 
## metadata(0):
## assays(4): norm_expr TPM CPM FPKM
## rownames(1456): ACOT7 ADORA3 ... TMSB4Y USP9Y
## rowData names(0):
## colnames(40): sample1 sample2 ... sample39 sample40
## colData names(40): OV_subtype os ... DNArep_Kang IPSOV_Shen

We can check all the signatures available in the package with the function availableSignatures.

availSigns <- availableSignatures()

The function returns a data frame with all the signatures included in the package and for each signature the following information:

  • signature: name of the signature
  • scoreLabel: label of the signature when computed and inserted inside results
  • functionName: name of the function to use to compute the signature
  • topic: general cancer topic
  • tumor: tumor type for which the signature was developed
  • tissue: tumor tissue for which the signature was developed
  • requiredInput: tumor data with which the signature was developed
  • transformationStep: data transformation step performed inside the function starting from the user’s ‘normArray’ or ‘normCounts’ data
  • author: first author of the work in which the signature is described
  • reference: reference of the work
  • description: brief description of the signature and how to evaluate its score
knitr::kable(t(availSigns[1,]), caption = 'One signature fiels')

Table 1: One signature fiels
signature EMT_Miow
scoreLabel EMT_Miow_Epithelial, EMT_Miow_Mesenchymal
functionName EMTSign
topic epithelial to mesenchymal
tumor ovarian cancer
tissue ovary
requiredInput microarray, rnaseq
transformationStep normArray, normCounts
author Miow
reference Miow Q. et al. Oncogene (2015)
description Double score obtained with ssGSEA to establish the epithelial- and the mesenchymal-like status in ovarian cancer patients.

We can also interrogate the table asking which signatures are available for a specific tissue (e.g. ovary).

ovary_signatures <- availableSignatures(tissue = "ovary", 
                                        description = FALSE)
             caption = 'Signatures developed for ovary.') %>% 
    kableExtra::kable_paper() %>% 
    kableExtra::scroll_box(width = "82%", height = "500px")
Table 2: Signatures developed for ovary.
signature scoreLabel functionName topic tumor tissue requiredInput transformationStep author reference
1 EMT_Miow EMT_Miow_Epithelial, EMT_Miow_Mesenchymal EMTSign epithelial to mesenchymal ovarian cancer ovary microarray, rnaseq normArray, normCounts Miow Miow Q. et al. Oncogene (2015)
4 Pyroptosis_Ye Pyroptosis_Ye pyroptosisSign pyroptosis ovarian cancer ovary rnaseq FPKM Ye Ye Y. et al. Cell Death Discov. (2021)
8 Ferroptosis_Ye Ferroptosis_Ye ferroptosisSign ferroptosis ovarian cancer ovary microarray, rnaseq normArray, FPKM Ye Ye Y. et al. Front. Mol. Biosci. (2021)
12 LipidMetabolism_Zheng LipidMetabolism_Zheng lipidMetabolismSign metabolism epithelial ovarian cancer ovary rnaseq normCounts Zheng Zheng M. et al. Int. J. Mol. Sci. (2020)
14 ImmunoScore_Hao ImmunoScore_Hao immunoScoreSign immune system epithelial ovarian cancer ovary microarray, rnaseq normArray, log2(FPKM+0.01) Hao Hao D. et al. Clin Cancer Res (2018)
16 ConsensusOV_Chen ConsensusOV_Chen_IMR, ConsensusOV_Chen_DIF, ConsensusOV_Chen_PRO, ConsensusOV_Chen_MES consensusOVSign ovarian subtypes high-grade serous ovarian carcinoma ovary microarray, rnaseq normArray, normCounts Chen Chen G.M. et al. Clin Cancer Res (2018)
18 Matrisome_Yuzhalin Matrisome_Yuzhalin matrisomeSign extracellular matrix ovarian cystadenocarcinoma, gastric adenocarcinoma, colorectal adenocarcinoma, lung adenocarcinoma ovary, lung, stomach, colon microarray, rnaseq normArray, normCounts Yuzhalin Yuzhalin A. et al. Br J Cancer (2018)
43 HRDS_Lu HRDS_Lu HRDSSign chromosomal instability ovarian cancer, breast cancer ovary, breast microarray, rnaseq normArray, normCounts Lu Lu J. et al. J Mol Med (2014)
45 DNArep_Kang DNArep_Kang DNArepSign chromosomal instability serous ovarian cystadenocarcinoma ovary microarray, rnaseq normArray, log2(normCount+1) Kang Kang J. et al. JNCI (2012)
46 IPSOV_Shen IPSOV_Shen IPSOVSign immune system ovarian cancer ovary microarray, rnaseq normArray, log2(normCount+1) Shen Shen S. et al. EBiomed (2019)

Once we have found a signature of interest, we can compute it by using the corresponding function (indicated in the functionName field of availableSignatures table). All the signature functions require the expression data and to indicate the type of input data (inputType equal to “rnaseq” or “microarray”). Data are supposed to be the normalized expression values in the form of a data frame or a matrix with genes in rows samples in columns. Alternatively, an SummarizedExperiment object containing an assay called ‘norm_expr’ where rows correspond to genes and columns correspond to samples.

ovse <- ferroptosisSign(dataset = ovse,
                        inputType = "rnaseq")
## ferroptosisSignYe is using 100% of signature genes

Signatures are often grouped in the same function by cancer topic even if they deal with different cancer types and computation approaches. We can unequivocally choose the one we are interested in by stating the first author of the signature (indicated in the author field of availableSignatures table). E.g., currently, there are three different epithelial to mesenchymal transition (EMT) signatures implemented inside the EMTSign function (“Miow”, “Mak” or “Cheng”). We can choose which one to compute stating the author argument:

ovse <- EMTSign(dataset = ovse,
                inputType = "rnaseq",
                author = "Miow")
## EMTSignMiow is using 96% of epithelial signature genes
## EMTSignMiow is using 91% of mesenchymal signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."

In this way, “EMT_Miow” is computed. Regardless of the expression input type, the output data of all the signature functions is a SummarizedExperiment with the original expression data in the assay and the computed signature scores in the colData. Thus, the returned object can be resubmitted as input data to another signature function and will be returned as well with the addition of the new signature in the colData.

We can also compute multiple signatures at once with the function multipleSign. Supplying the expression dataset and the input type without any other argument, all the signatures will be computed. Otherwise, we can specify a sub-group of signatures through the use of the arguments tissue, tumor and/or topic to define signature attributes that will additionally narrow the signature list. Alternatively, we can state exactly the signatures using the whichSign argument. E.g. here below we computed all the available signature for ovary and pan-tissue:

ovse <- multipleSign(dataset = ovse, 
                     inputType = "rnaseq",
                     tissue = c("ovary", "pan-tissue"))
## EMTSignMiow is using 96% of epithelial signature genes
## EMTSignMiow is using 91% of mesenchymal signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
## EMTSignMak is using 96% of epithelial signature genes
## EMTSignMak is using 100% of mesenchymal signature genes
## pyroptosisSignYe is using 86% of signature genes
## ferroptosisSignYe is using 100% of signature genes
## lipidMetabolismSign is using 100% of signature genes
## hypoxiaSign is using 92% of signature genes
## immunoScoreSignHao is using 100% of signature genes
## immunoScoreSignRoh is using 100% of signature genes
## 'select()' returned 1:1 mapping between keys and columns
## Loading training data
## Training Random Forest...
## IPSSign is using 98% of signature genes
## matrisomeSign is using 100% of signature genes
## mitoticIndexSign is using 100% of signature genes
## ImmuneCytSignRooney is using 100% of signature genes
## IFNSign is using 100% of signature genes
## expandedImmuneSign is using 100% of signature genes
## TinflamSign is using 100% of signature genes
## CINSign is using 96% of signature genes
## CINSign is using 94% of signature genes
## cellCycleSignLundberg is using 93% of signature genes
## cellCycleSignDavoli is using 100% of signature genes
## ASCSign is using 92% of signature genes
## ImmuneCytSignDavoli is using 100% of signature genes
## ChemokineSign is using 100% of signature genes
## ECMSign is using 100% of up signature genes
## ECMSign is using 93% of down signature genes
## Warning in .filterFeatures(expr, method): 1 genes with constant expression
## values throuhgout the samples.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."
## HRDSSign is using 89% of signature genes
## VEGFSign is using 100% of signature genes
## DNArepSign is using 87% of signature genes
## IPSOVSign is using 100% of signature genes
## Warning in .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq,
## abs.ranking, : Some gene sets have size one. Consider setting ' > 1'.
## [1] "Calculating ranks..."
## [1] "Calculating absolute values from ranks..."

4.3 Visualization of the Signatures

4.3.1 The Signature Distribution Plot

Every single signature computed can be explored using the oneSignPlot function to visualize both the score and the density distribution.

oneSignPlot(data = ovse, 
            whichSign = "Hypoxia_Buffa")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.3.2 The Gene Expression Heatmap

Users may be also interested in visually exploring the expression values of the genes involved in a signature. In this case, we can use geneHeatmapSignPlot to visualize them. It generates a heatmap of the expression values with genes on the rows and samples on the columns. Further, the function is not restricted to the visualization of only one signature, and we can also plot the expression values of genes of multiple signatures also evaluating the gene list intersections.

geneHeatmapSignPlot(data = ovse, 
                    whichSign = "LipidMetabolism_Zheng", 
                    logCount = TRUE)

geneHeatmapSignPlot(data = ovse, 
                    whichSign = c("IFN_Ayers", "Tinflam_Ayers"), 
                    logCount = TRUE)

4.3.3 The Correlation Plot

To easily investigate the relation across multiple signatures, signifinder provides the function to easily show the pairwise correlations of the signatures (correlationSignPlot). The whichSign argument could be set to specify which signatures should be plotted. When it is not stated all signatures inside the SummarizedExperiment data are used. Green-blue colors represent anticorrelations while orange-red scale is for positive correlations. Then, signatures are clustered to group together higher related ones.

sign_cor <- correlationSignPlot(data = ovse)