Gene Ontologies (GO) are often used to guide the interpretation of high-throughput omics experiments, with lists of differentially regulated genes being summarized into sets of genes with a common functional representation. Due to the hierachical nature of Gene Ontologies, the resulting lists of enriched sets are usually redundant and difficult to interpret.
rrvgo aims at simplifying the redundance of GO sets by grouping similar terms
based on their semantic similarity. It also provides some plots to help with
interpreting the summarized terms.
This software is heavily influenced by REVIGO. It mimics
a good part of its core functionality, and even some of the outputs are similar.
Without aims to compete,
rrvgo tries to offer a programatic interface using
available annotation databases and semantic similarity methods implemented in the
rrvgo does not care about genes, but GO terms. The input is a vector of enriched
GO terms, along with (recommended, but not mandatory) a vector of scores. If scores
are not provided,
rrvgo takes the GO term (set) size as a score, thus favoring
First step is to get the similarity matrix between terms. The function
takes a list of GO terms for which the semantic simlarity is to be calculated,
OrgDb object for an organism, the ontology of interest and the method to
calculate the similarity scores.
library(rrvgo) go_analysis <- read.delim(system.file("extdata/example.txt", package="rrvgo")) simMatrix <- calculateSimMatrix(go_analysis$ID, orgdb="org.Hs.eg.db", ont="BP", method="Rel")
semdata parameter (see
?calculateSimMatrix) is not mandatory as it is
calculated on demand. If the function needs to run several times with the same
organism, it’s advisable to save the
GOSemSim::godata(orgdb, ont=ont) object,
in order to reuse it between calls and speedup the calculation of the similarity
From the similarity matrix one can group terms based on similarity.
reduceSimMatrix function for that. It takes as arguments i) the
similarity matrix, ii) an optional named vector of scores associated to each
GO term, iii) a similarity threshold used for grouping terms, and iv) an orgdb
scores <- setNames(-log10(go_analysis$qvalue), go_analysis$ID) reducedTerms <- reduceSimMatrix(simMatrix, scores, threshold=0.7, orgdb="org.Hs.eg.db")
reduceSimMatrix groups terms which are at least within a similarity below
threshold, and selects as the group representative the term with the higher
score within the group. In case the vector of scores is not available,
reduceSimMatrix can either use the uniqueness of a term (default), or the
GO term size. In the case of size,
rrvgo will fetch the GO term size from
OrgDb object and use it as the score, thus favoring broader terms.
Please note that scores are interpreted in the direction that higher are
better, therefore if you use p-values as scores, minus log-transform them
rrvgo uses the similarity between pairs of terms to compute a distance
matrix, defined as
(1-simMatrix). The terms are then hierarchically clustered
using complete linkage, and the tree is cut at the desired threshold, picking
the term with the highest score as the representative of each group.
Therefore, higher thresholds lead to fewer groups, and the threshold should be read as the minimum similarity between group representatives.
rrvgo provides several methods for plotting and interpreting the results.
Plot similarity matrix as a heatmap, with clustering of columns of rows turned on by default (thus arranging together similar terms).
heatmapPlot(simMatrix, reducedTerms, annotateParent=TRUE, annotationLabel="parentTerm", fontsize=6)
The function internally uses
and further parameters can be passed to this function.
Plot GO terms as scattered points. Distances between points represent the similarity between terms, and axes are the first 2 components of applying a PCoA to the (di)similarity matrix. Size of the point represents the provided scores or, in its absence, the number of genes the GO term contains.