Contents

What is RcisTarget?

RcisTarget is an R-package to identify transcription factor (TF) binding motifs over-represented on a gene list.

RcisTarget is based on the methods previously implemented in i-cisTarget (web interface, region-based) and iRegulon (Cytoscape plug-in, gene-based).

If you use RcisTarget in your research, please cite:

## Aibar et al. (2017) SCENIC: single-cell regulatory network 
##   inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463

Overview of the workflow

The function cisTarget() allows to perform the motif-enrichment analysis on a gene list. The main input parameters are the gene list and the motif databases, which should be chosen depending on the organism and the search space around the TSS of the genes. This is a sample on how to run the analysis (see the following sections for details):

library(RcisTarget)

# Load gene sets to analyze. e.g.:
geneList1 <- c("gene1", "gene2", "gene3")
geneLists <- list(geneListName=geneList1)
# geneLists <- GSEABase::GeneSet(genes, setName="geneListName") # alternative

# Select motif database to use (i.e. organism and distance around TSS)
data(motifAnnotations_hgnc)
motifRankings <- importRankings("hg19-500bp-upstream-7species.mc9nr.feather")

# Motif enrichment analysis:
motifEnrichmentTable_wGenes <- cisTarget(geneLists, motifRankings,
                               motifAnnot=motifAnnotations_hgnc)

Advanced use: The cisTarget() function is enough for most simple analyses. However, for further flexibility (e.g. removing unnecessary steps on bigger analyses), RcisTarget also provides the possibility to run the inner functions individually. Running cisTarget() is equivalent to running this code:

# 1. Calculate AUC
motifs_AUC <- calcAUC(geneLists, motifRankings)

# 2. Select significant motifs, add TF annotation & format as table
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, 
                          motifAnnot=motifAnnotations_hgnc)

# 3. Identify significant genes for each motif
# (i.e. genes from the gene set in the top of the ranking)
# Note: Method 'iCisTarget' instead of 'aprox' is more accurate, but slower
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable, 
                                                   geneSets=geneLists,
                                                   rankings=motifRankings, 
                                                   nCores=1,
                                                   method="aprox")

Before starting

Setup

RcisTarget uses species-specific databases which are provided as independent R-packages. Prior to running RcisTarget, you will need to download and install the databases for the relevant organism (more details in the “Motif databases” section).

In addition, some extra packages can be installed to run RcisTarget in parallel or run the interactive examples in this tutorial:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "visNetwork"))

Some tips…

Help

At any time, remember you an access the help files for any function (i.e. ?cisTarget), and the other tutorials in the package with the following commands:

# Explore tutorials in the web browser:
browseVignettes(package="RcisTarget") 

# Commnad line-based:
vignette(package="RcisTarget") # list
vignette("RcisTarget") # open

Reports

To generate an HTML report with your own data and comments, you can use the .Rmd file of this tutorial as template (i.e. copy the .Rmd file, and edit it as R notebook in RStudio).

vignetteFile <- paste(file.path(system.file('doc', package='RcisTarget')), 
                      "RcisTarget.Rmd", sep="/")
# Copy to edit as markdown
file.copy(vignetteFile, ".")
# Alternative: extract R code
Stangle(vignetteFile)

Running RcisTarget

Input: Gene sets

The main input to RcisTarget is the gene set(s) to analyze. The gene sets can be provided as a ‘named list’ in which each element is a gene-set (character vector containing gene names) or as a GSEABase::GeneSet. The gene-set name will be used as ID in the following steps.

library(RcisTarget)
geneSet1 <- c("gene1", "gene2", "gene3")
geneLists <- list(geneSetName=geneSet1)
# or: 
# geneLists <- GSEABase::GeneSet(geneSet1, setName="geneSetName") 

Some extra help:

class(geneSet1)
class(geneLists)
geneSet2 <- c("gene2", "gene4", "gene5", "gene6")
geneLists[["anotherGeneSet"]] <- geneSet2
names(geneLists)
geneLists$anotherGeneSet
lengths(geneLists)

In this example we will be using a list of genes up-regulated in MCF7 cells under hypoxia conditions (PMID:16565084).

The original work highlighted the role of hypoxia-inducible factor (HIF1-alpha or HIF1A) pathways under hypoxia. This gene list is also used for the turorials in iRegulon (http://iregulon.aertslab.org/tutorial.html).

txtFile <- paste(file.path(system.file('examples', package='RcisTarget')),
                 "hypoxiaGeneSet.txt", sep="/")
geneLists <- list(hypoxia=read.table(txtFile, stringsAsFactors=FALSE)[,1])
head(geneLists$hypoxia)
## [1] "ADM"     "ADORA2B" "AHNAK2"  "AK4"     "AKAP12"  "ALDOC"

Required databases

To analyze the gene list, RcisTarget needs two types of databases: - 1. Gene-motif rankings: which provides the rankings (~score) of all the genes for each motif. - 2. The annotation of motifs to transcription factors.

Gene-motif rankings

The score of each pair of gene-motif can be performed using different parameters. Therefore, we provide multiple gene-rankings, according to several possibilities:

  • Species: Species of the input gene set. Available values: Human (Homo sapiens), mouse (Mus musculus) or fly (Drosophila melanogaster)
  • Scoring/search space: determine the search space around the transcription start site (TSS). Available values: 500 bp uptream the TSS, 5kbp or 10kbp around the TSS (e.g. 10kbp upstream and 10kbp downstream of the TSS).
  • Number of orthologous species taken into account to score the motifs (e.g. conservation of regions). Available values: 7 or 10 species

If you don’t know which database to choose, we would suggest using the 500bp upstream TSS, and a larger search space (e.g. TSS+/-5kbp or TSS+/-10kbp) with 7 species. Of course, selecting Human or Mouse depending on your input gene list.

For each ranking there are two files: one with extension .feather (which contains the main ranking, aprox 1GB), and one with extension .descr which contains the description of the file.

You can download the files using your preferred method. For example, to download within R:

featherURL <- "http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-7species.mc9nr.feather" # .feather
download.file(featherURL, destfile=basename(featherURL)) # saved in current dir
featherURL <- "http://pyscenic.aertslab.org/databases/hg19-500bp-upstream-7species.mc9nr.descr" # .descr
download.file(featherURL, destfile=basename(featherURL))

Download URLs for the current databases (version ‘mc9nr’: including 24453 motifs) [Aprox file size: 1GB]:

Species Search space #ort File
Human 500bp upstream TSS 7 hg19-500bp-upstream-7species.mc9nr
Human TSS+/-5kbp 7 hg19-tss-centered-5kb-7species.mc9nr
Human TSS+/-10kbp 7 hg19-tss-centered-10kb-7species.mc9nr
Human 500bp upstream TSS 10 hg19-500bp-upstream-10species.mc9nr
Human TSS+/-5kbp 10 hg19-tss-centered-5kb-10species.mc9nr
Human TSS+/-10kbp 10 hg19-tss-centered-10kb-10species.mc9nr
Mouse 500bp upstream TSS 7 mm9-500bp-upstream-7species.mc9nr
Mouse TSS+/-5kbp 7 mm9-tss-centered-5kb-7species.mc9nr
Mouse TSS+/-10kbp 7 mm9-tss-centered-10kb-7species.mc9nr
Mouse 500bp upstream TSS 10 mm9-500bp-upstream-10species.mc9nr
Mouse TSS+/-5kbp 10 mm9-tss-centered-5kb-10species.mc9nr
Mouse TSS+/-10kbp 10 mm9-tss-centered-10kb-10species.mc9nr
Fly Coming soon… (Previous version: dm6-5kb-upstream-full-tx-11species.mc8nr

Once you have both files, they can be loaded with importRankings():

# Search space: 10k bp around TSS - HUMAN
motifRankings <- importRankings("hg19-500bp-upstream-7species.mc9nr.feather")
# Load the annotation to human transcription factors
data(motifAnnotations_hgnc)

Motif annotations

All the calculations performed by RcisTarget are motif-based. However, most users will be interested on TFs potentially regulating the gene-set. The association of motif to transcription factors are provided in an independent file. In addition to the motifs annotated by the source datatabse (i.e. direct annotation), we have also inferred some further annotations based on motif similarity and gene ortology (e.g. similarity to other genes annotated to the motif). These annotations are typically used with the functions cisTarget() or addMotifAnnotation().

For the motifs in version ‘mc9nr’ of the rankings (24453 motifs), these annotations are already included in RcisTarget package and can be loaded with the command:

# mouse:
# data(motifAnnotations_mgi)

# human:
data(motifAnnotations_hgnc)
motifAnnotations_hgnc[199:202,]
##           motif     TF directAnnotation inferred_Orthology
## 1: bergman__tin NKX2-8            FALSE               TRUE
## 2: bergman__tll   DUX4            FALSE              FALSE
## 3: bergman__tll  NR2E1            FALSE               TRUE
## 4: bergman__usp   RXRA            FALSE               TRUE
##    inferred_MotifSimil                       annotationSource
## 1:                TRUE inferredBy_MotifSimilarity_n_Orthology
## 2:                TRUE             inferredBy_MotifSimilarity
## 3:               FALSE                   inferredBy_Orthology
## 4:               FALSE                   inferredBy_Orthology
##                                                                                                                                             description
## 1: gene is orthologous to FBgn0261930 in D. melanogaster (identity = 39%) which is annotated for similar motif bergman__vnd ('vnd'; q-value = 0.000831)
## 2:                                                     gene is annotated for similar motif transfac_pro__M06988 ('V$DUX4_01: DUX4'; q-value = 0.000745)
## 3:                                         gene is orthologous to FBgn0003720 in D. melanogaster (identity = 41%) which is directly annotated for motif
## 4:                                         gene is orthologous to FBgn0003964 in D. melanogaster (identity = 45%) which is directly annotated for motif

For other versions of the rankings, the function importAnnotations allows to import the annotations from the source file.

These annotations can be easily queried to obtain further information about specific motifs or TFs:

motifAnnotations_hgnc[(directAnnotation==TRUE) 
                               & 
                               (TF %in% c("HIF1A")), ]
##                               motif    TF directAnnotation
##  1:                    cisbp__M3388 HIF1A             TRUE
##  2:                    cisbp__M3389 HIF1A             TRUE
##  3:                    cisbp__M3390 HIF1A             TRUE
##  4:                    cisbp__M3391 HIF1A             TRUE
##  5:                    cisbp__M6275 HIF1A             TRUE
##  6: hocomoco__HIF1A_HUMAN.H11MO.0.C HIF1A             TRUE
##  7:          homer__TACGTGCV_HIF-1a HIF1A             TRUE
##  8:      swissregulon__hs__HIF1A.p2 HIF1A             TRUE
##  9:            transfac_pro__M00466 HIF1A             TRUE
## 10:            transfac_pro__M00797 HIF1A             TRUE
## 11:            transfac_pro__M00976 HIF1A             TRUE
## 12:            transfac_pro__M02012 HIF1A             TRUE
## 13:            transfac_pro__M02378 HIF1A             TRUE
## 14:            transfac_pro__M07043 HIF1A             TRUE
## 15:            transfac_pro__M07384 HIF1A             TRUE
##     inferred_Orthology inferred_MotifSimil annotationSource
##  1:              FALSE               FALSE directAnnotation
##  2:              FALSE               FALSE directAnnotation
##  3:              FALSE               FALSE directAnnotation
##  4:              FALSE               FALSE directAnnotation
##  5:              FALSE               FALSE directAnnotation
##  6:              FALSE               FALSE directAnnotation
##  7:              FALSE               FALSE directAnnotation
##  8:              FALSE               FALSE directAnnotation
##  9:              FALSE               FALSE directAnnotation
## 10:              FALSE               FALSE directAnnotation
## 11:              FALSE               FALSE directAnnotation
## 12:              FALSE               FALSE directAnnotation
## 13:              FALSE               FALSE directAnnotation
## 14:              FALSE               FALSE directAnnotation
## 15:              FALSE               FALSE directAnnotation
##                    description
##  1: gene is directly annotated
##  2: gene is directly annotated
##  3: gene is directly annotated
##  4: gene is directly annotated
##  5: gene is directly annotated
##  6: gene is directly annotated
##  7: gene is directly annotated
##  8: gene is directly annotated
##  9: gene is directly annotated
## 10: gene is directly annotated
## 11: gene is directly annotated
## 12: gene is directly annotated
## 13: gene is directly annotated
## 14: gene is directly annotated
## 15: gene is directly annotated

Database example (subset)

In addition to the full versions of the databases (20k motifs), we also provide a subset containing only the 4.6k motifs from cisbp (human only: RcisTarget.hg19.motifDBs.cisbpOnly.500bp). These subsets are available in Bioconductor for demonstration purposes. They will provide the same AUC score for the existing motifs. However, we highly recommend to use the full version (~20k motifs) for more accurate results, since the normalized enrichment score (NES) of the motif depends on the total number of motifs in the database.

To install this package:

# From Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("RcisTarget.hg19.motifDBs.cisbpOnly.500bp")

For this vignette (demonstration purposes), we will use this database:

library(RcisTarget.hg19.motifDBs.cisbpOnly.500bp)
# Rankings
data(hg19_500bpUpstream_motifRanking_cispbOnly)
motifRankings <- hg19_500bpUpstream_motifRanking_cispbOnly
motifRankings
## Rankings for RcisTarget.
##   Organism: human (hg19)
##   Number of genes: 22285 (22285 available in the full DB)
##   Number of MOTIFS: 4687
## ** This database includes rankings up to 5050
## 
## Subset (4.6k cisbp motifs) of the Human database scoring motifs 500bp upstream the TSS (hg19-500bp-upstream-7species.mc9nr)
# Annotations
data(hg19_motifAnnotation_cisbpOnly) 
motifAnnotations_hgnc <- hg19_motifAnnotation_cisbpOnly

Running the analysis

Once the gene lists and databases are loaded, they can be used by cisTarget(). cisTarget() runs sequentially the steps to perform the (1) motif enrichment analysis, (2) motif-TF annotation, and (3) selection of significant genes.

It is also possible to run these steps as individual commands. For example, to skip steps, for analyses in which the user is interested in one of the outputs, or to optimize the workflow to run it on multiple gene lists (see Advanced section for details).

motifEnrichmentTable_wGenes <- cisTarget(geneLists, 
         motifRankings,
         motifAnnot=motifAnnotations_hgnc)

Advanced: Execute step by step

1. Calculate enrichment

The first step to estimate the over-representation of each motif on the gene-set is to calculate the Area Under the Curve (AUC) for each pair of motif-geneSet. This is calculated based on the recovery curve of the gene-set on the motif ranking (genes ranked decreasingly by the score of motif in its proximity, as provided in the motifRanking database).

motifs_AUC <- calcAUC(geneLists, motifRankings, nCores=1)

The AUC is provided as a matrix of Motifs by GeneSets. In principle, the AUC is mostly meant as input for the next step. However, it is also possible to explore the distribution of the scores, for example in a gene-set of interest:

auc <- getAUC(motifs_AUC)["hypoxia",]
hist(auc, main="hypoxia", xlab="AUC histogram",
     breaks=100, col="#ff000050", border="darkred")
nes3 <- (3*sd(auc)) + mean(auc)
abline(v=nes3, col="red")

2. Select significant motifs and/or annotate to TFs

The selection of significant motifs is done based on the Normalized Enrichment Score (NES). The NES is calculated -for each motif- based on the AUC distribution of all the motifs for the gene-set [(x-mean)/sd]. Those motifs that pass the given threshold (3.0 by default) are considered significant.

Furthermore, this step also allows to add the TFs annotated to the motif.

motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, nesThreshold=3,
     motifAnnot=motifAnnotations_hgnc,
     highlightTFs=list(hypoxia="HIF1A"))
class(motifEnrichmentTable)
## [1] "data.table" "data.frame"
dim(motifEnrichmentTable)
## [1] 17  8
head(motifEnrichmentTable[,-"TF_lowConf", with=FALSE])
##    geneSet        motif  NES    AUC highlightedTFs TFinDB
## 1: hypoxia cisbp__M6275 4.41 0.0966          HIF1A     **
## 2: hypoxia cisbp__M0062 3.57 0.0841          HIF1A       
## 3: hypoxia cisbp__M6279 3.56 0.0840          HIF1A       
## 4: hypoxia cisbp__M6212 3.49 0.0829          HIF1A       
## 5: hypoxia cisbp__M2332 3.48 0.0828          HIF1A       
## 6: hypoxia cisbp__M0387 3.41 0.0817          HIF1A       
##                  TF_highConf
## 1: HIF1A (directAnnotation).
## 2:                          
## 3: HMGA1 (directAnnotation).
## 4: EPAS1 (directAnnotation).
## 5:                          
## 6:

The cathegories considered high/low confidence can me modified with the arguments motifAnnot_highConfCat and motifAnnot_lowConfCat.

3. Identify the genes with the best enrichment for each Motif

Since RcisTarget searches for enrichment of a motif within a gene list, finding a motif ‘enriched’ does not imply that all the genes in the gene-list have a high score for the motif. In this way, the third step of the workflow is to identify which genes (of the gene-set) are highly ranked for each of the significant motifs.

There are two methods to identify these genes: (1) equivalent to the ones used in iRegulon and i-cisTarget (method="iCisTarget", recommended if running time is not an issue), and (2) a faster implementation based on an approximate distribution using the average at each rank (method="aprox", useful to scan multiple gene sets).

IMPORTANT: Make sure that the motifRankings are the same as in Step 1.

motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable,
                      rankings=motifRankings, 
                      geneSets=geneLists)
dim(motifEnrichmentTable_wGenes)
## [1] 17 11

Plot for a few motifs:

geneSetName <- "hypoxia"
selectedMotifs <- c("cisbp__M6275", sample(motifEnrichmentTable$motif, 2))
par(mfrow=c(2,2))
getSignificantGenes(geneLists[[geneSetName]], 
                    motifRankings,
                    signifRankingNames=selectedMotifs,
                    plotCurve=TRUE, maxRank=5000-20, genesFormat="none",
                    method="aprox")