GREAT (Genomic Regions Enrichment of Annotations Tool) is a popular web-based tool to associate biological functions to genomic regions, however, its nature of being an online tool has several limitations:

  1. limited number of supported organisms. Current version 4.0.4 only supports four organisms: “hg38”, “hg19”, “mm10” and “mm9”.
  2. limited number of ontologies (or gene set collections). Current version 4.0.4 only supports seven ontologies (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655442/Version+History).
  3. maximal 500,000 test regions (https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655402/File+Size).
  4. the update of annotation databases is only controlled by the GREAT developers.

From rGREAT version 1.99.0, it implements the GREAT algorithms locally and it can be seamlessly integrated to the Bioconductor annotation ecosystem. This means, theoretically with rGREAT, it is possible to perform GREAT analysis with any organism and with any type of gene set collection / ontology. Another advantage is, Bioconductor annotation packages are always very well maintained and updated, which means the data source of your analysis can be ensured to be most up-to-date.

First let’s load the rGREAT package and generate a random set of regions:

library(rGREAT)

set.seed(123)
gr = randomRegions(nr = 1000, genome = "hg19")

Perform local GREAT with great()

The function great() is the core function to perform local GREAT analysis. You can either use built-in annotations or use self-provided annotations.

With defaultly supported annotation packages

great() has integrated many annotation databases which cover many organisms. Users simply specify the name of gene set collection (via the second argument) and the source of TSS (via the third argument).

res = great(gr, "MSigDB:H", "txdb:hg19")
res
## 916 regions are associated to 1400 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: MSigDB:H
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

There are following supported gene set collections. The first category are GO gene sets. The gene sets data is from GO.db package.

  • "GO:BP": Biological Process.
  • "GO:CC": Cellular Component.
  • "GO:MP": Molecular Function.

The prefix GO: can be omitted when it is specified in great().

The second category of gene sets are from MSigDB. Note this is only for human:

  • "msigdb:H" Hallmark gene sets.
  • "msigdb:C1" Positional gene sets.
  • "msigdb:C2" Curated gene sets.
  • "msigdb:C2:CGP" C2 subcategory: chemical and genetic perturbations gene sets.
  • "msigdb:C2:CP" C2 subcategory: canonical pathways gene sets.
  • "msigdb:C2:CP:BIOCARTA" C2 subcategory: BioCarta subset of CP.
  • "msigdb:C2:CP:KEGG" C2 subcategory: KEGG subset of CP.
  • "msigdb:C2:CP:PID" C2 subcategory: PID subset of CP.
  • "msigdb:C2:CP:REACTOME" C2 subcategory: REACTOME subset of CP.
  • "msigdb:C2:CP:WIKIPATHWAYS" C2 subcategory: WIKIPATHWAYS subset of CP.
  • "msigdb:C3" Regulatory target gene sets.
  • "msigdb:C3:MIR:MIRDB" miRDB of microRNA targets gene sets.
  • "msigdb:C3:MIR:MIR_LEGACY" MIR_Legacy of MIRDB.
  • "msigdb:C3:TFT:GTRD" GTRD transcription factor targets gene sets.
  • "msigdb:C3:TFT:TFT_LEGACY" TFT_Legacy.
  • "msigdb:C4" Computational gene sets.
  • "msigdb:C4:CGN" C4 subcategory: cancer gene neighborhoods gene sets.
  • "msigdb:C4:CM" C4 subcategory: cancer modules gene sets.
  • "msigdb:C5" Ontology gene sets.
  • "msigdb:C5:GO:BP" C5 subcategory: BP subset.
  • "msigdb:C5:GO:CC" C5 subcategory: CC subset.
  • "msigdb:C5:GO:MF" C5 subcategory: MF subset.
  • "msigdb:C5:HPO" C5 subcategory: human phenotype ontology gene sets.
  • "msigdb:C6" Oncogenic signature gene sets.
  • "msigdb:C7" Immunologic signature gene sets.
  • "msigdb:C7:IMMUNESIGDB" ImmuneSigDB subset of C7.
  • "msigdb:C7:VAX" C7 subcategory: vaccine response gene sets.
  • "msigdb:C8" Cell type signature gene sets.

The prefix msigdb: can be omitted when specified in great() and the name of a MSigDb can be used as case insensitive.

rGREAT supports TSS from several sources. The value of argument tss_source should be encoded in a special format:

  • Name of a TxDb.* package, e.g. TxDb.Hsapiens.UCSC.hg19.knownGene. Supported packages are in rGREAT:::BIOC_ANNO_PKGS$txdb.
  • Genome version of the organism, e.g. “hg19”. Then the corresponding TxDb package will be used.
  • In a format of RefSeq:$genome where $genome is the genome version of an organism. RefSeqSelect genes will be used.
  • In a format of RefSeqCurated:$genome where $genome is the genome version of an organism. RefSeqCurated subset will be used.
  • In a format of RefSeqSelect:$genome where $genome is the genome version of an organism. RefSeqSelect subset will be used.
  • In a format of Gencode_v$version where $version is Gencode version, such as 19 (for human) or M21 for mouse. Gencode protein coding genes will be used.
  • In a format of GREAT:$genome, where $genome can only be “mm9”, “mm10”, “hg19”, “hg38”. TSS from GREAT will be used. The data is downloaded from https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655445/Genes.

The difference of RefSeqCurated and RefSeqSelect is explained in https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite.

Some examples are:

great(gr, "GO:BP", "hg19")
great(gr, "GO:BP", "TxDb.Hsapiens.UCSC.hg19.knownGene")
great(gr, "GO:BP", "RefSeq:hg19")
great(gr, "GO:BP", "GREAT:hg19")
great(gr, "GO:BP", "Gencode_v19")

Manually set gene sets

If users have their own gene sets, the gene sets can be set as a named list of vectors where each vector corresponds to one gene set. Please note the genes in the gene sets must be in Entrez ID type. In the following example, we use a gene set collection from DSigDB. This collection (FDA approved kinase inhibitors) only contains 28 gene sets.

Here read_gmt() is a simple function that reads a gmt file as a list of vectors, also performs gene ID conversion.

gs = read_gmt(url("http://dsigdb.tanlab.org/Downloads/D2_LINCS.gmt"), 
    from = "SYMBOL", to = "ENTREZ", orgdb = "org.Hs.eg.db")
gs[1:2]
## $GSK429286A
##  [1] "5566"  "9475"  "81788" "6093"  "5562"  "26524" "5592"  "5979"  "640"  
## [10] "5567"  "3717"  "5613"  "23012" "695"   "3718" 
## 
## $`BS-181`
##  [1] "122011" "1452"   "1022"   "7272"   "1454"   "1453"   "65975"  "147746"
##  [9] "1859"   "1195"   "1196"   "9149"   "57396"  "5261"
great(gr, gs, "hg19")
## 916 regions are associated to 1400 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Manually set TSS

Users may have their own set of genes/TSS, as in the following example:

df = read.table(url("https://jokergoo.github.io/rGREAT_suppl/data/GREATv4.genes.hg19.tsv"))
# note there must be a 'gene_id' column
tss = GRanges(seqnames = df[, 2], ranges = IRanges(df[, 3], df[, 3]), 
    strand = df[, 4], gene_id = df[, 5])
head(tss)
## GRanges object with 6 ranges and 1 metadata column:
##       seqnames    ranges strand |     gene_id
##          <Rle> <IRanges>  <Rle> | <character>
##   [1]     chr1     69090      + |       OR4F5
##   [2]     chr1    367639      + |      OR4F29
##   [3]     chr1    622053      - |      OR4F16
##   [4]     chr1    861117      + |      SAMD11
##   [5]     chr1    894670      - |       NOC2L
##   [6]     chr1    895966      + |      KLHL17
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

In this case, users must manually generate an “extended TSS” by extendTSS() function. They should also explicitly specify the gene ID type in extendTSS() so that great() can correctly map to the genes in gene sets.

In the example, IDs for genes in tss are symbols, thus, gene_id_type must be set to "SYMBOL" so that the correct gene ID type will be selected for internal gene sets.

et = extendTSS(tss, genome = "hg19", gene_id_type = "SYMBOL")
great(gr, "msigdb:h", extended_tss = et)
## 916 regions are associated to 1405 genes' extended TSSs.
##   TSS source: self-provided
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: msigdb:h
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

If gene ID type in tss is one of Ensembl/RefSeq/Entrez ID, gene_id_type argument can be omitted because the ID type can be automatically inferred from the format of the gene IDs, but it is always a good idea to explicitly specify it if the data is self-provided.

Manually set gene sets and transcriptome annotations

If your organism is not defaultly supported, you can always use the extendTSS() to manually construct one. Note in GREAT algorithm, TSS are first extended by a rule (e.g. basal plus extension). extendTSS() accepts a GRanges object of gene or TSS and it returns a new GRanges object of extended TSS.

In the following example, since I don’t have data for the organism not defaultly supported by rGREAT. I simply use human data to demonstrate how to manually construct the extended TSS.

There are two objects for extendTSS(): the gene (or the TSS) and the length of chromosomes. The gene object must have a meta column named “gene_id” which stores gene ID in a specific type (this ID type will be mapped to the genes in gene sets). The chromosome length object is a named vector and it also controls the total set of chromosomes to be used in the analysis.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
gene = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
gene = gene[seqnames(gene) %in% paste0("chr", c(1:22, "X", "Y"))]
head(gene)
## GRanges object with 6 ranges and 1 metadata column:
##             seqnames              ranges strand |     gene_id
##                <Rle>           <IRanges>  <Rle> | <character>
##           1    chr19   58858172-58874214      - |           1
##          10     chr8   18248755-18258723      + |          10
##         100    chr20   43248163-43280376      - |         100
##        1000    chr18   25530930-25757445      - |        1000
##       10000     chr1 243651535-244006886      - |       10000
##   100008586     chrX   49217763-49233491      + |   100008586
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
gl = seqlengths(gene)[paste0("chr", c(1:22, "X", "Y"))]  # restrict to normal chromosomes
head(gl)
##      chr1      chr2      chr3      chr4      chr5      chr6 
## 249250621 243199373 198022430 191154276 180915260 171115067

Simply send gene and gl to extendTSS():

et = extendTSS(gene, gl)
head(et)
## GRanges object with 6 ranges and 3 metadata columns:
##             seqnames        ranges strand |     gene_id tss_position
##                <Rle>     <IRanges>  <Rle> | <character>    <integer>
##   100287102     chr1       1-28962      * |   100287102        11874
##      653635     chr1   12873-64091      * |      653635        29961
##       79501     chr1  34961-139567      * |       79501        69091
##      729737     chr1  70090-713069      * |      729737       140566
##   100288069     chr1 145566-757971      * |   100288069       714068
##      643837     chr1 719068-763970      * |      643837       762971
##              tss_strand
##             <character>
##   100287102           +
##      653635           -
##       79501           +
##      729737           -
##   100288069           -
##      643837           +
##   -------
##   seqinfo: 24 sequences from an unspecified genome

We can also manually construct the gene sets object, which is simply a named list of vectors where each vector contains genes in a gene set. Here we directly use the object gs generated before which is a gene set collection from DSigDB.

Please note again, gene IDs in gs should be the same as in et.

Now gs and et can be sent to great() to perform local GREAT with the annotation data you manually provided.

great(gr, gene_sets = gs, extended_tss = et)
## 1000 regions are associated to 1490 genes' extended TSSs.
##   TSS source: self-provided
##   Genome: unknown
##   Gene sets: self-provided
##   Background: whole genome
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Use BioMart genes and genesets

rGREAT also supports GO gene sets for a huge number of organisms retrieved from Ensembl BioMart. A specific organism can be set with the biomart_dataset argument:

# Giant panda
gr_panda = randomRegionsFromBioMartGenome("amelanoleuca_gene_ensembl")
great(gr_panda, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")
## 960 regions are associated to 1589 genes' extended TSSs.
##   TSS source: amelanoleuca_gene_ensembl
##   Genome: Giant panda genes (ASM200744v2)
##   Gene sets: GO:BP
##   Background: whole genome
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Note both TSS and gene sets are from BioMart. The value for gene sets (the second argument) can only be one of "GO:BP", "GO:CC" and "GO:MF".

rGREAT now supports 556 organisms. The complete list can be found at https://jokergoo.github.io/rGREAT_genesets/. Just make sure the genome version is the same as the genome of your input regions.

Set background regions

In the online GREAT tool, if background regions are set, it actually uses a different test for the enrichment analysis. In GREAT, when background is set, input regions should be exactly subset of background. For example, let’s say a background region list contains five regions: [1, 10], [15, 23], [34, 38], [40, 49], [54, 63], input regions can only be a subset of the five regions, which means they can take [15, 23], [40, 49], but it cannot take [16, 20], [39, 51]. In this setting, regions are taken as single units and Fisher’s exact test is applied for calculating the enrichment (by testing number of regions in the 2x2 contigency table).

This might be useful for certain cases, e.g., for a specific transcriptional factor (TF), we take the union of ChIP-seq peaks of this TF from all tissues as the background set, and only take peaks from one specific tisse as input region set, and we want to test the enrichment of TF peaks in the tissue compared to the “background”. However, this “background definition” might not be the case as many other users may think. They might take “background” as a set of regions where they only want to perform GREAT (the Binomial method) in. E.g. they may want to exclude “gap regions / unsequenced regions” from the analysis because the null hypothesis of Binomial test is the input regions are uniformly distributed in the genome. Since the unsequenced regions will never be measured, they should be excluded from the analysis. Other examples are that the background can be set as regions showing similar GC contents or CpG density as the input regions.

great() supports two arguments background and exclude for setting a proper background. If any one of the two is set, the input regions and the extended TSS regions are intersected to the background, and GREAT algorithm is only applied to the reduced regions.

When whole genome is set as background, denote \(N_1\) as the total number of input regions, \(p_1\) as the fraction of genome that are overlapped to extended TSS of genes in a certain gene set, and \(K_1\) as the number of regions that overlap to the gene set associated regions, then the enrichment test is based on the random variable \(K_1\) which follows Binomial distribution \(K_1 \sim B(p_1, N_1)\).

Similarly, when background regions are set, denote \(N_2\) as the total number of input regions that overlap to backgroud, \(p_2\) as the fraction of background that are overlapped to extended TSS of genes in a certain gene set, and \(K_2\) as the number of regions that overlap to the gene set associated regions and also overlap to background, then the enrichment test is based on the random variable \(K_2\) which follows Binomial distribution \(K_2 \sim B(p_2, N_2)\).

In fact, the native hypergeometric method in GREAT can be approximated to the binomial method here. Nevertheless, the binomial method is more general and it has no restriction as the hypergeometric method where input regions must be perfect subsets of backgrounds.

In the following example, getGapFromUCSC() can be used to retrieve gap regions from UCSC table browser.

gap = getGapFromUCSC("hg19", paste0("chr", c(1:22, "X", "Y")))
great(gr, "MSigDB:H", "hg19", exclude = gap)

Note as the same as GREAT, rGREAT by default excludes gap regions from the analysis.

Alternatively, background and exclude can also be set to a vector of chromosome names, then the whole selected chromosomes will be included/excluded from the analysis.

great(gr, "GO:BP", background = paste0("chr", 1:22))
great(gr, "GO:BP", exclude = c("chrX", "chrY"))

Get enrichment table

Simply use getEnrichmentTable() function.

tb = getEnrichmentTable(res)
head(tb)
##                                   id genome_fraction observed_region_hits
## 1                 HALLMARK_APOPTOSIS     0.015572219                   22
## 2           HALLMARK_SPERMATOGENESIS     0.013943482                   20
## 3       HALLMARK_IL2_STAT5_SIGNALING     0.021152759                   25
## 4      HALLMARK_BILE_ACID_METABOLISM     0.010077105                   13
## 5        HALLMARK_HEDGEHOG_SIGNALING     0.006965189                    9
## 6 HALLMARK_OXIDATIVE_PHOSPHORYLATION     0.012476423                   14
##   fold_enrichment    p_value  p_adjust mean_tss_dist observed_gene_hits
## 1        1.542328 0.03308353 0.7148313        250612                 18
## 2        1.565897 0.03574156 0.7148313        233929                 18
## 3        1.290261 0.12179347 0.8728821        189866                 22
## 4        1.408355 0.14054771 0.8728821        167309                 12
## 5        1.410633 0.19372944 0.8728821        307394                  7
## 6        1.225018 0.25884518 0.8728821        148492                 13
##   gene_set_size fold_enrichment_hyper p_value_hyper p_adjust_hyper
## 1           158             1.5105515    0.05180621      0.3453747
## 2           134             1.7810981    0.01192743      0.2197825
## 3           196             1.4882872    0.03956849      0.3320113
## 4           112             1.4206378    0.13766929      0.5506772
## 5            36             2.5781944    0.01648369      0.2197825
## 6           199             0.8661845    0.74420267      0.9737239

In getEnrichmentTable(), you can set argument min_region_hits to set the minimal number of input regions that hit a geneset-associated regions. Note the adjusted p-values will be recalculated in the table.

There are also columns for hypergeometric test on the numbers of genes, the same as in the original GREAT method.

There is a new column “mean_tss_dist” in the result table which is the mean absolute distance of input regions to TSS of genes in a gene set. Please note with larger distance to TSS, the more we need to be careful with the reliability of the associations between input regions to genes.

Make volcano plot

In differential gene expression analysis, volcano plot is used to visualize relations between log2 fold change and (adjusted) p-values. Similarly, we can also use volcano plot to visualize relations between fold enrichment and (adjusted) p-values for the enrichment analysis. The plot is made by the function plotVolcano():

plotVolcano(res)

As the enrichment analysis basically only looks for over-representations, it is actually half volcano.

Get region-gene associations

plotRegionGeneAssociations() generates three plots similar as those by online GREAT. getRegionGeneAssociations() returns a GRanges object containing associations between regions and genes.

plotRegionGeneAssociations(res)