BgeeDB, an R package for retrieval of curated expression datasets and for gene list enrichment tests

Andrea Komljenovic, Julien Roux, Marc Robinson-Rechavi, Frederic Bastian

2017-10-30

BgeeDB is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from human and many other animal species.

The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the quantitative data and expression calls produced by the Bgee pipeline. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns, based on the topGO package. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages.

In summary, the BgeeDB package allows to: * 1. List annotation of RNA-seq and microarray data available the Bgee database * 2. Download the processed gene expression data available in the Bgee database * 3. Download the gene expression calls and use them to perform TopAnat analyses

Installation

In R:

#source("https://bioconductor.org/biocLite.R")
#biocLite("BgeeDB")

How to use BgeeDB package

Load the package

library(BgeeDB)

Running example: downloading and formatting processed RNA-seq data

List available species in Bgee

The listBgeeSpecies() function allows to retrieve available species in the Bgee database, and which data types are available for each species.

listBgeeSpecies()
## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 14...
## 
## Submitting URL to Bgee webservice... (http://bgee.org/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.4.0)
## 
## Query to Bgee webservice successful!
##       ID           GENUS     SPECIES_NAME         COMMON_NAME AFFYMETRIX
## 1   6239  Caenorhabditis          elegans            nematode       TRUE
## 2   7217      Drosophila        ananassae                          FALSE
## 3   7227      Drosophila     melanogaster           fruit fly       TRUE
## 4   7230      Drosophila       mojavensis                          FALSE
## 5   7237      Drosophila    pseudoobscura                          FALSE
## 6   7240      Drosophila         simulans                          FALSE
## 7   7244      Drosophila          virilis                          FALSE
## 8   7245      Drosophila           yakuba                          FALSE
## 9   7955           Danio            rerio           zebrafish       TRUE
## 10  8364         Xenopus       tropicalis western clawed frog      FALSE
## 11  9031          Gallus           gallus             chicken      FALSE
## 12  9258 Ornithorhynchus         anatinus            platypus      FALSE
## 13  9365       Erinaceus        europaeus            hedgehog      FALSE
## 14  9544          Macaca          mulatta             macaque       TRUE
## 15  9593         Gorilla          gorilla             gorilla      FALSE
## 16  9597             Pan         paniscus              bonobo      FALSE
## 17  9598             Pan      troglodytes          chimpanzee      FALSE
## 18  9606            Homo          sapiens               human       TRUE
## 19  9615           Canis lupus familiaris                 dog      FALSE
## 20  9685           Felis            catus                 cat      FALSE
## 21  9796           Equus         caballus               horse      FALSE
## 22  9823             Sus           scrofa                 pig      FALSE
## 23  9913             Bos           taurus              cattle      FALSE
## 24  9986     Oryctolagus        cuniculus              rabbit      FALSE
## 25 10090             Mus         musculus               mouse       TRUE
## 26 10116          Rattus       norvegicus                 rat       TRUE
## 27 10141           Cavia        porcellus          guinea pig      FALSE
## 28 13616     Monodelphis        domestica             opossum      FALSE
## 29 28377          Anolis     carolinensis         green anole      FALSE
##      EST IN_SITU RNA_SEQ
## 1  FALSE    TRUE    TRUE
## 2  FALSE   FALSE    TRUE
## 3   TRUE    TRUE    TRUE
## 4  FALSE   FALSE    TRUE
## 5  FALSE   FALSE    TRUE
## 6  FALSE   FALSE    TRUE
## 7  FALSE   FALSE    TRUE
## 8  FALSE   FALSE    TRUE
## 9   TRUE    TRUE    TRUE
## 10  TRUE    TRUE    TRUE
## 11 FALSE   FALSE    TRUE
## 12 FALSE   FALSE    TRUE
## 13 FALSE   FALSE    TRUE
## 14 FALSE   FALSE    TRUE
## 15 FALSE   FALSE    TRUE
## 16 FALSE   FALSE    TRUE
## 17 FALSE   FALSE    TRUE
## 18  TRUE   FALSE    TRUE
## 19 FALSE   FALSE    TRUE
## 20 FALSE   FALSE    TRUE
## 21 FALSE   FALSE    TRUE
## 22 FALSE   FALSE    TRUE
## 23 FALSE   FALSE    TRUE
## 24 FALSE   FALSE    TRUE
## 25  TRUE    TRUE    TRUE
## 26 FALSE   FALSE    TRUE
## 27 FALSE   FALSE    TRUE
## 28 FALSE   FALSE    TRUE
## 29 FALSE   FALSE    TRUE

It is possible to list all species from a specific release of Bgee with the release argument (see listBgeeRelease() function), and order the species according to a specific columns with the ordering argument. For example:

listBgeeSpecies(release = "13.2", order = 2)
## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 13_2...
## 
## Submitting URL to Bgee webservice... (http://bgee.org/bgee13/?page=species&display_type=tsv&source=BgeeDB_R_package&source_version=2.4.0)
## 
## Query to Bgee webservice successful!
##       ID           GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX   EST IN_SITU
## 17 28377          Anolis carolinensis      anolis      FALSE FALSE   FALSE
## 13  9913             Bos       taurus         cow      FALSE FALSE   FALSE
## 1   6239  Caenorhabditis      elegans   c.elegans       TRUE FALSE    TRUE
## 3   7955           Danio        rerio   zebrafish       TRUE  TRUE    TRUE
## 2   7227      Drosophila melanogaster    fruitfly       TRUE  TRUE    TRUE
## 5   9031          Gallus       gallus     chicken      FALSE FALSE   FALSE
## 8   9593         Gorilla      gorilla     gorilla      FALSE FALSE   FALSE
## 11  9606            Homo      sapiens       human       TRUE  TRUE   FALSE
## 7   9544          Macaca      mulatta     macaque      FALSE FALSE   FALSE
## 16 13616     Monodelphis    domestica     opossum      FALSE FALSE   FALSE
## 14 10090             Mus     musculus       mouse       TRUE  TRUE    TRUE
## 6   9258 Ornithorhynchus     anatinus    platypus      FALSE FALSE   FALSE
## 9   9597             Pan     paniscus      bonobo      FALSE FALSE   FALSE
## 10  9598             Pan  troglodytes  chimpanzee      FALSE FALSE   FALSE
## 15 10116          Rattus   norvegicus         rat      FALSE FALSE   FALSE
## 12  9823             Sus       scrofa         pig      FALSE FALSE   FALSE
## 4   8364         Xenopus   tropicalis     xenopus      FALSE  TRUE    TRUE
##    RNA_SEQ
## 17    TRUE
## 13    TRUE
## 1     TRUE
## 3    FALSE
## 2    FALSE
## 5     TRUE
## 8     TRUE
## 11    TRUE
## 7     TRUE
## 16    TRUE
## 14    TRUE
## 6     TRUE
## 9     TRUE
## 10    TRUE
## 15    TRUE
## 12    TRUE
## 4     TRUE

Create a new Bgee object

In the following example we will choose to focus on mouse (“Mus_musculus”) RNA-seq. Species can be specified using their name or their NCBI taxonomic IDs. To specify that RNA-seq data want to be downloaded, the dataType argument is set to “rna_seq”. To download Affymetrix microarray data, set this argument to “affymetrix”.

bgee <- Bgee$new(species = "Mus_musculus", dataType = "rna_seq")
## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 14_0...
## 
## Submitting URL to Bgee webservice... (http://bgee.org/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.4.0)
## 
## Query to Bgee webservice successful!
## 
## API key built: eb4b101199c0eeda36b735165d43f3e3bce04cac64ebb394f2e847cef08e7773870daacdf7a46df4ef7c215fc02c4d432dd5de10417ced2f0fa49433cb555af1
## 
## IMPORTANT INFORMATION: For now, only the processed data download part of the package is using data from Bgee release 14. The TopAnat part of the package is still based on data from Bgee release 13.2, and we are working on updating it to use data from Bgee release 14.

Note 1: It is possible to work with data from a specific release of Bgee by specifying the release argument, see listBgeeRelease() function.

Note 2: The functions of the package will store the downloaded files in a versioned folder created by default in the working directory. These cache files allow faster re-access to the data. The directory where data are stored can be changed with the pathToData argument.

Retrieve the annotation of mouse RNA-seq datasets

The getAnnotation() function will output the list of RNA-seq experiments and libraries available in Bgee for mouse.

annotation_bgee_mouse <- getAnnotation(bgee)
## 
## Downloading annotation files...
## 
## Saved annotation files in /tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Mus_musculus_Bgee_14_0 folder.
# list the first experiments and libraries
lapply(annotation_bgee_mouse, head)
## $sample.annotation
##   Experiment.ID Library.ID Anatomical.entity.ID Anatomical.entity.name
## 1      GSE30617  ERX012344       UBERON:0000948                  heart
## 2      GSE30617  ERX012348       UBERON:0000948                  heart
## 3      GSE30617  ERX012362       UBERON:0000948                  heart
## 4      GSE30617  ERX012363       UBERON:0000948                  heart
## 5      GSE30617  ERX012374       UBERON:0000948                  heart
## 6      GSE30617  ERX012378       UBERON:0000948                  heart
##         Stage.ID      Stage.name  Sex       Strain
## 1 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
## 2 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
## 3 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
## 4 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
## 5 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
## 6 MmusDv:0000052 8 weeks (mouse) <NA> DBAxC57BL/6J
##                    Platform.ID Library.type Library.orientation
## 1 Illumina Genome Analyzer IIx       paired                  NA
## 2 Illumina Genome Analyzer IIx       paired                  NA
## 3 Illumina Genome Analyzer IIx       paired                  NA
## 4 Illumina Genome Analyzer IIx       paired                  NA
## 5 Illumina Genome Analyzer IIx       paired                  NA
## 6 Illumina Genome Analyzer IIx       paired                  NA
##   TPM.expression.threshold FPKM.expression.threshold Read.count
## 1                 0.024357                  0.023059   30075234
## 2                 0.042566                  0.047824    8605668
## 3                 0.024598                  0.028740   29498377
## 4                 0.025378                  0.024916   31000737
## 5                 0.027712                  0.026467   31160789
## 6                 0.025381                  0.025119   27824366
##   Mapped.read.count Min.read.length Max.read.length
## 1          26055677               0               0
## 2           7347266               0               0
## 3          25708709               0               0
## 4          26507831               0               0
## 5          27397595               0               0
## 6          23808859               0               0
##   All.genes.percent.present Protein.coding.genes.percent.present
## 1                     40.42                                69.20
## 2                     25.95                                51.95
## 3                     43.40                                69.54
## 4                     42.31                                70.52
## 5                     40.92                                68.87
## 6                     40.36                                68.93
##   Intergenic.regions.percent.present   Run.IDs Data.source
## 1                              14.44 ERR032229         GEO
## 2                               6.46 ERR032228         GEO
## 3                              17.08 ERR032238         GEO
## 4                              15.54 ERR032227         GEO
## 5                              14.66 ERR032231         GEO
## 6                              14.40 ERR032230         GEO
##                                               Data.source.URL
## 1 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012344
## 2 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012348
## 3 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012362
## 4 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012363
## 5 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012374
## 6 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012378
##                                                                                                                 Bgee.normalized.data.URL
## 1 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 2 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 3 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 4 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 5 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 6 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
##                                                 Raw.file.URL
## 1 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012344
## 2 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012348
## 3 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012362
## 4 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012363
## 5 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012374
## 6 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX012378
## 
## $experiment.annotation
##   Experiment.ID
## 1      GSE30617
## 2      GSE41637
## 3      GSE30352
## 4     SRP022612
## 5      GSE36026
## 6      GSE43520
##                                                               Experiment.name
## 1                                            [E-MTAB-599] Mouse Transcriptome
## 2   Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## 3                 The evolution of gene expression levels in mammalian organs
## 4 RNA-seq of Danio rerio and Mus musculus skin for three different age groups
## 5                                                    RNA-seq from ENCODE/LICR
## 6    The evolution of lncRNA repertoires and expression patterns in tetrapods
##   Library.count Condition.count Organ.stage.count Organ.count Stage.count
## 1            36               6                 6           6           1
## 2            26              26                 9           9           1
## 3            17              17                 6           6           1
## 4            15               2                 3           1           3
## 5            12              12                12          12           3
## 6             9               7                 5           4           2
##   Sex.count Strain.count Data.source
## 1         0            1         GEO
## 2         1            6         GEO
## 3         2            2         GEO
## 4         1            2         GEO
## 5         2            1         GEO
## 6         2            2         GEO
##                                               Data.source.URL
## 1  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 2  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## 3  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30352
## 4 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=SRP022612
## 5  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36026
## 6  http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43520
##                                                                                                                  Bgee.normalized.data.URL
## 1  ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip
## 2  ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE41637.tsv.zip
## 3  ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30352.tsv.zip
## 4 ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP022612.tsv.zip
## 5  ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE36026.tsv.zip
## 6  ftp://bgee.org/bgee_v14/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE43520.tsv.zip
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Experiment.description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 2                                                                                                                                                                                                                                                                                                                                                      Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
## 3 Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Our transcriptome data provide a valuable resource for functional and evolutionary analyses of mammalian genomes.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Comparison of temporal gene expression profiles (www.jenage.de) The RNA-seq data comprises 3 age groups: 2, 15 and 30 months for mouse skin; 5, 24 and 42 months for zebrafish skin. Illumina 50bp single-stranded single-read RNA sequencing Overall design: 15 samples for mouse: 5 biological replicates for 2 months, 6 biological replicates for 15 months and 4 biological replicates for 30 months; 20 samples for zebrafish: 9 biological replicates for 5 months, 6 biological replicates for 24 months and 5 biological replicates for 42 months
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Using RNA-Seq (Mortazavi et al., 2008), high-resolution genome-wide maps of the mouse transcriptome across multiple mouse (C57Bl/6) tissues and primary cells were generated.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                To broaden our understanding of lncRNA evolution, we used an extensive RNA-seq dataset to establish lncRNA repertoires and homologous gene families in 11 tetrapod species. We analyzed the poly- adenylated transcriptomes of 8 organs (cortex/whole brain without cerebellum, cerebellum, heart, kidney, liver, placenta, ovary and testis) and 11 species (human, chimpanzee, bonobo, gorilla, orangutan, macaque, mouse, opossum, platypus, chicken and the frog Xenopus tropicalis), which shared a common ancestor ~370 millions of years (MY) ago. Our dataset included 47 strand-specific samples, which allowed us to confirm the orientation of gene predictions and to address the evolution of sense-antisense transcripts. See also GSE43721 (Soumillon et al, Cell Reports, 2013) for three strand-specific samples for mouse brain, liver and testis.

Download the processed mouse RNA-seq data

The getData() function will download processed RNA-seq data from all mouse experiments in Bgee as a list.

# download all RNA-seq experiments from mouse
data_bgee_mouse <- getData(bgee)
## 
## The experiment is not defined. Hence taking all rna_seq experiments available for Mus_musculus.
## 
## Downloading expression data...
## 
## Saved expression data file in /tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Mus_musculus_Bgee_14_0 folder. Now unzipping file...
## 
Read 39.0% of 1718244 rows
Read 83.8% of 1718244 rows
Read 1718244 rows and 16 (of 16) columns from 0.284 GB file in 00:00:04
## 
Read 77.4% of 1240954 rows
Read 1240954 rows and 16 (of 16) columns from 0.213 GB file in 00:00:03
## 
## Saving all data in .rds file...
# number of experiments downloaded
length(data_bgee_mouse)
## [1] 9
# check the downloaded data
lapply(data_bgee_mouse, head)
## [[1]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE43721  SRX219282       single ENSMUSG00000000001
## 2      GSE43721  SRX219282       single ENSMUSG00000000003
## 3      GSE43721  SRX219282       single ENSMUSG00000000028
## 4      GSE43721  SRX219282       single ENSMUSG00000000031
## 5      GSE43721  SRX219282       single ENSMUSG00000000037
## 6      GSE43721  SRX219282       single ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000473                 testis MmusDv:0000055
## 2       UBERON:0000473                 testis MmusDv:0000055
## 3       UBERON:0000473                 testis MmusDv:0000055
## 4       UBERON:0000473                 testis MmusDv:0000055
## 5       UBERON:0000473                 testis MmusDv:0000055
## 6       UBERON:0000473                 testis MmusDv:0000055
##         Stage.name  Sex  Strain Read.count       TPM      FPKM
## 1 11 weeks (mouse) male C57BL/6       3077 11.356800 12.591405
## 2 11 weeks (mouse) male C57BL/6          0  0.000000  0.000000
## 3 11 weeks (mouse) male C57BL/6       2491 17.577187 19.488013
## 4 11 weeks (mouse) male C57BL/6         74  0.748961  0.830381
## 5 11 weeks (mouse) male C57BL/6       1659  6.053025  6.711053
## 6 11 weeks (mouse) male C57BL/6       4060 54.832049 60.792873
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[2]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE43013  SRX211656       paired ENSMUSG00000000001
## 2      GSE43013  SRX211656       paired ENSMUSG00000000003
## 3      GSE43013  SRX211656       paired ENSMUSG00000000028
## 4      GSE43013  SRX211656       paired ENSMUSG00000000031
## 5      GSE43013  SRX211656       paired ENSMUSG00000000037
## 6      GSE43013  SRX211656       paired ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 2       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 3       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 4       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 5       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 6       UBERON:0000082 adult mammalian kidney UBERON:0000113
##                  Stage.name  Sex    Strain Read.count       TPM      FPKM
## 1 post-juvenile adult stage male wild-type       1525 46.583577 37.122217
## 2 post-juvenile adult stage male wild-type          0  0.000000  0.000000
## 3 post-juvenile adult stage male wild-type         20  0.889774  0.709056
## 4 post-juvenile adult stage male wild-type          2  0.269064  0.214416
## 5 post-juvenile adult stage male wild-type          5  0.110121  0.087755
## 6 post-juvenile adult stage male wild-type        136 11.615857  9.256618
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[3]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE41338  SRX191152       paired ENSMUSG00000000001
## 2      GSE41338  SRX191152       paired ENSMUSG00000000003
## 3      GSE41338  SRX191152       paired ENSMUSG00000000028
## 4      GSE41338  SRX191152       paired ENSMUSG00000000031
## 5      GSE41338  SRX191152       paired ENSMUSG00000000037
## 6      GSE41338  SRX191152       paired ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 2       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 3       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 4       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 5       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 6       UBERON:0000082 adult mammalian kidney UBERON:0000113
##                  Stage.name   Sex  Strain Read.count       TPM      FPKM
## 1 post-juvenile adult stage mixed C57BL/6        567 14.501564 16.562366
## 2 post-juvenile adult stage mixed C57BL/6          0  0.000000  0.000000
## 3 post-juvenile adult stage mixed C57BL/6         21  0.877042  1.001678
## 4 post-juvenile adult stage mixed C57BL/6          2  0.065568  0.074886
## 5 post-juvenile adult stage mixed C57BL/6          0  0.000000  0.000000
## 6 post-juvenile adult stage mixed C57BL/6        192 12.685089 14.487754
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4         absent      high quality                         Part of a call
## 5         absent      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[4]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE36026  SRX147592       single ENSMUSG00000000001
## 2      GSE36026  SRX147592       single ENSMUSG00000000003
## 3      GSE36026  SRX147592       single ENSMUSG00000000028
## 4      GSE36026  SRX147592       single ENSMUSG00000000031
## 5      GSE36026  SRX147592       single ENSMUSG00000000037
## 6      GSE36026  SRX147592       single ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1           CL:0000057             fibroblast MmusDv:0000052
## 2           CL:0000057             fibroblast MmusDv:0000052
## 3           CL:0000057             fibroblast MmusDv:0000052
## 4           CL:0000057             fibroblast MmusDv:0000052
## 5           CL:0000057             fibroblast MmusDv:0000052
## 6           CL:0000057             fibroblast MmusDv:0000052
##        Stage.name  Sex  Strain Read.count        TPM        FPKM
## 1 8 weeks (mouse) male C57BL/6       5275  26.901318   66.348954
## 2 8 weeks (mouse) male C57BL/6          0   0.000000    0.000000
## 3 8 weeks (mouse) male C57BL/6        380   3.636568    8.969169
## 4 8 weeks (mouse) male C57BL/6      78239 965.590380 2381.515721
## 5 8 weeks (mouse) male C57BL/6        102   0.602173    1.485190
## 6 8 weeks (mouse) male C57BL/6          4   0.109412    0.269851
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[5]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE30617  ERX012344       paired ENSMUSG00000000001
## 2      GSE30617  ERX012344       paired ENSMUSG00000000003
## 3      GSE30617  ERX012344       paired ENSMUSG00000000028
## 4      GSE30617  ERX012344       paired ENSMUSG00000000031
## 5      GSE30617  ERX012344       paired ENSMUSG00000000037
## 6      GSE30617  ERX012344       paired ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000948                  heart MmusDv:0000052
## 2       UBERON:0000948                  heart MmusDv:0000052
## 3       UBERON:0000948                  heart MmusDv:0000052
## 4       UBERON:0000948                  heart MmusDv:0000052
## 5       UBERON:0000948                  heart MmusDv:0000052
## 6       UBERON:0000948                  heart MmusDv:0000052
##        Stage.name Sex       Strain Read.count       TPM      FPKM
## 1 8 weeks (mouse)  NA DBAxC57BL/6J       1029 14.732067 13.946815
## 2 8 weeks (mouse)  NA DBAxC57BL/6J          0  0.000000  0.000000
## 3 8 weeks (mouse)  NA DBAxC57BL/6J         61  1.938954  1.835604
## 4 8 weeks (mouse)  NA DBAxC57BL/6J         94  3.661912  3.466724
## 5 8 weeks (mouse)  NA DBAxC57BL/6J         23  0.212652  0.201317
## 6 8 weeks (mouse)  NA DBAxC57BL/6J          8  0.669763  0.634063
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[6]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE43520  SRX217693       single ENSMUSG00000000001
## 2      GSE43520  SRX217693       single ENSMUSG00000000003
## 3      GSE43520  SRX217693       single ENSMUSG00000000028
## 4      GSE43520  SRX217693       single ENSMUSG00000000031
## 5      GSE43520  SRX217693       single ENSMUSG00000000037
## 6      GSE43520  SRX217693       single ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000473                 testis UBERON:0000113
## 2       UBERON:0000473                 testis UBERON:0000113
## 3       UBERON:0000473                 testis UBERON:0000113
## 4       UBERON:0000473                 testis UBERON:0000113
## 5       UBERON:0000473                 testis UBERON:0000113
## 6       UBERON:0000473                 testis UBERON:0000113
##                  Stage.name  Sex  Strain Read.count       TPM      FPKM
## 1 post-juvenile adult stage male C57BL/6       1240 11.779760 11.467125
## 2 post-juvenile adult stage male C57BL/6          0  0.000000  0.000000
## 3 post-juvenile adult stage male C57BL/6        585 10.046090  9.779467
## 4 post-juvenile adult stage male C57BL/6         71  2.325751  2.264026
## 5 post-juvenile adult stage male C57BL/6        718  7.448692  7.251004
## 6 post-juvenile adult stage male C57BL/6        782 27.757432 27.020750
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[7]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE30352  SRX081914       single ENSMUSG00000000001
## 2      GSE30352  SRX081914       single ENSMUSG00000000003
## 3      GSE30352  SRX081914       single ENSMUSG00000000028
## 4      GSE30352  SRX081914       single ENSMUSG00000000031
## 5      GSE30352  SRX081914       single ENSMUSG00000000037
## 6      GSE30352  SRX081914       single ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 2       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 3       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 4       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 5       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 6       UBERON:0000082 adult mammalian kidney UBERON:0000113
##                  Stage.name    Sex  Strain Read.count       TPM      FPKM
## 1 post-juvenile adult stage female C57BL/6       2870 39.095102 39.816500
## 2 post-juvenile adult stage female C57BL/6          0  0.000000  0.000000
## 3 post-juvenile adult stage female C57BL/6         51  1.343029  1.367811
## 4 post-juvenile adult stage female C57BL/6         15  0.682456  0.695049
## 5 post-juvenile adult stage female C57BL/6         11  0.253984  0.258671
## 6 post-juvenile adult stage female C57BL/6          2  0.077435  0.078864
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4        present      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[8]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1      GSE41637  SRX196275       paired ENSMUSG00000000001
## 2      GSE41637  SRX196275       paired ENSMUSG00000000003
## 3      GSE41637  SRX196275       paired ENSMUSG00000000028
## 4      GSE41637  SRX196275       paired ENSMUSG00000000031
## 5      GSE41637  SRX196275       paired ENSMUSG00000000037
## 6      GSE41637  SRX196275       paired ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 2       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 3       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 4       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 5       UBERON:0000082 adult mammalian kidney UBERON:0000113
## 6       UBERON:0000082 adult mammalian kidney UBERON:0000113
##                  Stage.name  Sex  Strain Read.count       TPM      FPKM
## 1 post-juvenile adult stage <NA> C57BL/6       8990 38.639422 31.498242
## 2 post-juvenile adult stage <NA> C57BL/6          0  0.000000  0.000000
## 3 post-juvenile adult stage <NA> C57BL/6        218  1.402498  1.143294
## 4 post-juvenile adult stage <NA> C57BL/6          9  0.043174  0.035195
## 5 post-juvenile adult stage <NA> C57BL/6         26  0.123947  0.101040
## 6 post-juvenile adult stage <NA> C57BL/6        788  8.513097  6.939742
##   Detection.flag Detection.quality                          State.in.Bgee
## 1        present      high quality                         Part of a call
## 2         absent      high quality Result excluded, reason: pre-filtering
## 3        present      high quality                         Part of a call
## 4         absent      high quality                         Part of a call
## 5        present      high quality                         Part of a call
## 6        present      high quality                         Part of a call
## 
## [[9]]
##   Experiment.ID Library.ID Library.type            Gene.ID
## 1     SRP022612  SRX277661       single ENSMUSG00000000001
## 2     SRP022612  SRX277661       single ENSMUSG00000000003
## 3     SRP022612  SRX277661       single ENSMUSG00000000028
## 4     SRP022612  SRX277661       single ENSMUSG00000000031
## 5     SRP022612  SRX277661       single ENSMUSG00000000037
## 6     SRP022612  SRX277661       single ENSMUSG00000000049
##   Anatomical.entity.ID Anatomical.entity.name       Stage.ID
## 1       UBERON:0000014           zone of skin MmusDv:0000062
## 2       UBERON:0000014           zone of skin MmusDv:0000062
## 3       UBERON:0000014           zone of skin MmusDv:0000062
## 4       UBERON:0000014           zone of skin MmusDv:0000062
## 5       UBERON:0000014           zone of skin MmusDv:0000062
## 6       UBERON:0000014           zone of skin MmusDv:0000062
##                  Stage.name  Sex  Strain Read.count         TPM
## 1 2 month-old stage (mouse) male C57BL/6       5956   67.499246
## 2 2 month-old stage (mouse) male C57BL/6          0    0.000000
## 3 2 month-old stage (mouse) male C57BL/6        154    3.071474
## 4 2 month-old stage (mouse) male C57BL/6      40562 1047.442212
## 5 2 month-old stage (mouse) male C57BL/6        244    3.770360
## 6 2 month-old stage (mouse) male C57BL/6          3    0.218505
##          FPKM Detection.flag Detection.quality
## 1   69.611952        present      high quality
## 2    0.000000         absent      high quality
## 3    3.167611        present      high quality
## 4 1080.226848        present      high quality
## 5    3.888371        present      high quality
## 6    0.225344        present      high quality
##                            State.in.Bgee
## 1                         Part of a call
## 2 Result excluded, reason: pre-filtering
## 3                         Part of a call
## 4                         Part of a call
## 5                         Part of a call
## 6                         Part of a call
# isolate the first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]

The result of the getData() function is, for each experiment, a data frame with the different samples listed in rows, one after the other. Each row is a gene and the expression levels are displayed as raw read counts or RPKMs. A detection flag indicates is the gene is significantly expressed above background level of expression.

Note 1: An additional column in the data frame including expression in the TPM unit will be available from Bgee release 14 (planned for the end of 2016).

Note 2: If microarray data are downloaded, rows correspond to probesets and log2 of expression intensities are available instead of read counts/RPKMs.

Alternatively, you can choose to download data from only one particular RNA-seq experiment from Bgee with the experimentId parameter:

# download data for GSE30617
data_bgee_mouse_gse30617 <- getData(bgee, experimentId = "GSE30617")
## 
## Downloading expression data for the experiment GSE30617 ...
## 
## Saved expression data file in /tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Mus_musculus_Bgee_14_0 folder. Now unzipping /tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Mus_musculus_Bgee_14_0/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.zip file...
## 
Read 61.7% of 1718244 rows
Read 1718244 rows and 16 (of 16) columns from 0.284 GB file in 00:00:03
## 
## Saving all data in .rds file...

Format the RNA-seq data

It is sometimes easier to work with data organized as a matrix, where rows represent genes or probesets and columns represent different samples. The formatData() function reformats the data into an ExpressionSet object including: * An expression data matrix, with genes or probesets as rows, and samples as columns (assayData slot). The stats argument allows to choose if the matrix should be filled with read counts, RPKMs (and soon TPMs) for RNA-seq data. For microarray data the matrix is filled with log2 expression intensities. * A data frame listing the samples and their anatomical structure and developmental stage annotation (phenoData slot) * For microarray data, the mapping from probesets to Ensembl genes (featureData slot)

The callType argument allows to retain only actively expressed genes or probesets, if set to “present” or “present high quality”. Genes or probesets that are absent in a given sample are given NA values.

# use only present calls and fill expression matric with RPKM values
gene.expression.mouse.rpkm <- formatData(bgee, data_bgee_mouse_gse30617, callType = "present", stats = "rpkm")
## 
## Extracting expression data matrix...
##   Keeping only present genes.
## 
## Extracting features information...
## 
## Extracting samples information...
gene.expression.mouse.rpkm 
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 47729 features, 36 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: ERX012344 ERX012345 ... ERX012379 (36 total)
##   varLabels: Library.ID Anatomical.entity.ID ... Stage.name (5
##     total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000109578 (47729 total)
##   fvarLabels: Gene.ID
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Running example: TopAnat gene expression enrichment analysis

For some documentation on the TopAnat analysis, please refer to our publications, or to the web-tool page (http://bgee.org/?page=top_anat#/).

Create a new Bgee object

Similarly to the quantitative data download example above, the first step of a topAnat analysis is to built an object from the Bgee class. For this example, we will focus on zebrafish:

# Creating new Bgee class object
bgee <- Bgee$new(species = "Danio_rerio")
## 
## NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ") for the next steps.
## 
## Querying Bgee to get release information...
## 
## NOTE: the file describing Bgee species information for release 14_0 was found in the download directory /tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes. Data will not be redownloaded.
## 
## API key built: eb4b101199c0eeda36b735165d43f3e3bce04cac64ebb394f2e847cef08e7773870daacdf7a46df4ef7c215fc02c4d432dd5de10417ced2f0fa49433cb555af1
## 
## IMPORTANT INFORMATION: For now, only the processed data download part of the package is using data from Bgee release 14. The TopAnat part of the package is still based on data from Bgee release 13.2, and we are working on updating it to use data from Bgee release 14.

Note : We are free to specify any data type of interest using the dataType argument among rna_seq, affymetrix, est or in_situ, or even a combination of data types. If nothing is specified, as in the above example, all data types available for the targeted species are used. This equivalent to specifying dataType=c("rna_seq","affymetrix","est","in_situ").

Download the data allowing to perform TopAnat analysis

The loadTopAnatData() function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.

# Loading calls of expression
myTopAnatData <- loadTopAnatData(bgee)
## 
## Building URLs to retrieve organ relationships from Bgee.........
##    URL successfully built (http://bgee.org/?page=r_package&action=get_anat_entity_relations&display_type=tsv&species_list=7955&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=eb4b101199c0eeda36b735165d43f3e3bce04cac64ebb394f2e847cef08e7773870daacdf7a46df4ef7c215fc02c4d432dd5de10417ced2f0fa49433cb555af1&source=BgeeDB_R_package&source_version=2.4.0)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Danio_rerio_Bgee_14_0"
## 
## Building URLs to retrieve organ names from Bgee.................
##    URL successfully built (http://bgee.org/?page=r_package&action=get_anat_entities&display_type=tsv&species_list=7955&attr_list=ID&attr_list=NAME&api_key=eb4b101199c0eeda36b735165d43f3e3bce04cac64ebb394f2e847cef08e7773870daacdf7a46df4ef7c215fc02c4d432dd5de10417ced2f0fa49433cb555af1&source=BgeeDB_R_package&source_version=2.4.0)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Danio_rerio_Bgee_14_0"
## 
## Building URLs to retrieve mapping of gene to organs from Bgee...
##    URL successfully built (http://bgee.org/?page=r_package&action=get_expression_calls&display_type=tsv&species_list=7955&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=eb4b101199c0eeda36b735165d43f3e3bce04cac64ebb394f2e847cef08e7773870daacdf7a46df4ef7c215fc02c4d432dd5de10417ced2f0fa49433cb555af1&source=BgeeDB_R_package&source_version=2.4.0&data_qual=SILVER)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpE4Myub/Rbuild79615266fb54/BgeeDB/vignettes/Danio_rerio_Bgee_14_0"
## 
## Parsing the results.............................................
## 
## Adding BGEE:0 as unique root of all terms of the ontology.......
## 
## WARNING: some organs names appear to be missing. There might be some problem with the ontology data.
## 
## Done.
# Look at the data
## str(myTopAnatData)

The strigency on the quality of expression calls can be changed with the confidence argument. Finally, if you are interested in expression data coming from a particular developmental stage or a group of stages, please specify the a Uberon stage Id in the stage argument.

## Loading only high-quality expression calls from affymetrix data made on embryonic samples only 
## This is just given as an example, but is not run in this vignette because only few data are returned
bgee <- Bgee$new(species = "Danio_rerio", dataType="affymetrix")
myTopAnatData <- loadTopAnatData(bgee, stage="UBERON:0000068", confidence="high_quality")

Note: As mentioned above, the downloaded data files are stored in a versioned folder that can be set with the pathToData argument when creating the Bgee class object (default is the working directory). If you query again Bgee with the exact same parameters, these cached files will be read instead of querying the web-service again. It is thus important, if you plan to reuse the same data for multiple parallel topAnat analyses, to plan to make use of these cached files instead of redownloading them for each analysis. The cached files also give the possibility to repeat analyses offline.

Prepare a topAnatData object allowing to perform TopAnat analysis with topGO

First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata object in the topGO package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes, annotated with “spermatogenesis” or children terms in the Gene Ontology. We use the biomaRt package to retrieve this list of genes. We expect them to be enriched for expression in male tissues, notably the testes. The background list of genes is set to all genes annotated to at least one Gene Ontology term, allowing to account for biases in which types of genes are more likely to receive Gene Ontology annotation.

# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("drerio_gene_ensembl", mart=ensembl)

# Foreground genes are those with GO annotation "spermatogenesis" or childrem terms 
myGenes <- getBM(attributes= "ensembl_gene_id", filters=c("go_parent_term"), values=list(c("GO:0007283")), mart=ensembl)

# Background are all genes with GO annotation
universe <- getBM(attributes= "ensembl_gene_id", filters=c("with_go"), values=list(c(TRUE)), mart=ensembl)

# Prepare the gene list vector 
geneList <- factor(as.integer(universe[,1] %in% myGenes[,1]))
names(geneList) <- universe[,1]
head(geneList)
summary(geneList == 1)

# Prepare the topGO object
myTopAnatObject <-  topAnat(myTopAnatData, geneList)

The above code using the biomaRt package is not executed in this vignette to prevent building issues of our package in case of biomaRt downtime. Instead we use a geneList object saved in the data/ folder that we built using pre-downloaded data.

load("../data/geneList.RData")
myTopAnatObject <-  topAnat(myTopAnatData, geneList)
## 
## Checking topAnatData object.............
## 
## Checking gene list......................
## 
## WARNING: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 21801 genes in background will be kept.
## 
## Building most specific Ontology terms...  (  1270  Ontology terms found. )
## 
## Building DAG topology...................  (  2163  Ontology terms and  4126  relations. )
## 
## Annotating nodes (Can be long)..........  (  21801  genes annotated to the Ontology terms. )

Warning: This can be long, especially if the gene list is large, since the Uberon anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.

Launch the enrichment test

For this step, see the vignette of the topGO package for more details, as you have to directly use the tests implemented in the topGO package, as shown in this example:

results <- runTest(myTopAnatObject, algorithm = 'classic', statistic = 'fisher')
## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 298 nontrivial nodes
##       parameters: 
##           test statistic: fisher

You can also choose one of the topGO decorrelation methods, for example the “weight” method, allowing to avoid redundant results induced by the structure of the ontology.

results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')

Warning: This can be long because of the size of the ontology. Consider running scripts in batch mode if you have multiple analyses to do.

Format the table of results after an enrichment test for anatomical terms

The makeTable function allows to filter and format the test results, and calculate FDR values.

# Display results sigificant at a 10% FDR threshold
makeTable(myTopAnatData, myTopAnatObject, results, cutoff = 0.1)
## 
## Building the results table for the 5 significant terms at FDR threshold of 0.1...
## Ordering results by pValue column in increasing order...
## Done
##                       organId                organName annotated
## CL:0000039         CL:0000039           germ line cell        34
## UBERON:0000079 UBERON:0000079 male reproductive system     16563
## UBERON:0000473 UBERON:0000473                   testis     16563
## UBERON:0003135 UBERON:0003135  male reproductive organ     16563
## UBERON:0003101 UBERON:0003101            male organism     16564
##                significant expected foldEnrichment       pValue        FDR
## CL:0000039               3     0.06      50.000000 2.816424e-05 0.03526163
## UBERON:0000079          37    28.87       1.281607 3.768462e-04 0.09455731
## UBERON:0000473          37    28.87       1.281607 3.768462e-04 0.09455731
## UBERON:0003135          37    28.87       1.281607 3.768462e-04 0.09455731
## UBERON:0003101          37    28.87       1.281607 3.776251e-04 0.09455731

At the time of building this vignette (April 2017), the significant terms were all related to male reproductive system (testes), which makes biological sense: there is an expression bias for testis of genes involved in spermatogenesis.

By default results are sorted by p-value, but this can be changed with the ordering parameter by specifying which column should be used to order the results (preceded by a “-” sign to indicate that ordering should be made in decreasing order). For example, it is often convenient to sort significant structures by decreasing enrichment fold, using ordering = -6. The full table of results can be obtained using cutoff = 1.

Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO package.