Using Bioconductor for Annotation
Bioconductor has extensive facilities for mapping between microarray
probe, gene, pathway, gene ontology, homology and other annotations.
Bioconductor has built-in representations of GO, KEGG, vendor, and
other annotations, and can easily access NCBI, Biomart, UCSC, and
The following psuedo-code illustrates a typical R / Bioconductor
session. It continues the
differential expression workflow,
taking a 'top table' of differentially expressed probesets and
discovering the genes probed, and the Gene Ontology pathways to which
## Affymetrix U133 2.0 array IDs of interest; these might be
## obtained from
## tbl <- topTable(efit, coef=2)
## ids <- tbl[["ID"]]
## as part of a more extensive workflow.
> ids <- c("39730_at", "1635_at", "1674_at", "40504_at", "40202_at")
## load libraries as sources of annotation
## To list the kinds of things that can be retrieved, use the cols method.
## To list the kinds of things that can be used as keys
## use the keytypes method
## To extract viable keys of a particular kind, use the keys method.
> head(keys(hgu95av2.db, keytype="ENTREZID"))
## the select method allows you to mao probe ids to ENTREZ gene ids...
> select(hgu95av2.db, ids, "ENTREZID", "PROBEID")
1 39730_at 25
2 1635_at 25
3 1674_at 7525
4 40504_at 5445
5 40202_at 687
## ... and to GENENAME etc.
> select(hgu95av2.db, ids, c("ENTREZID","GENENAME"), "PROBEID")
PROBEID ENTREZID GENENAME
1 39730_at 25 c-abl oncogene 1, non-receptor tyrosine kinase
2 1635_at 25 c-abl oncogene 1, non-receptor tyrosine kinase
3 1674_at 7525 v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1
4 40504_at 5445 paraoxonase 2
5 40202_at 687 Kruppel-like factor 9
## find and extract the GO ids associated with the first id
> res <- select(hgu95av2.db, ids, "GO", "PROBEID")
PROBEID GO EVIDENCE ONTOLOGY
1 39730_at GO:0000115 TAS BP
2 39730_at GO:0000287 IDA MF
3 39730_at GO:0003677 NAS MF
4 39730_at GO:0003785 TAS MF
5 39730_at GO:0004515 TAS MF
6 39730_at GO:0004713 IDA MF
## use GO.db to find the Terms associated with those GOIDs
> head(select(GO.db, res$GO, "TERM", "GOID"))
1 GO:0000115 regulation of transcription involved in S phase of mitotic cell cycle
2 GO:0000287 magnesium ion binding
3 GO:0003677 DNA binding
4 GO:0003785 actin monomer binding
5 GO:0004515 nicotinate-nucleotide adenylyltransferase activity
6 GO:0004713 protein tyrosine kinase activity
[ Back to top ]
Installation and Use
Follow installation instructions to start using these
packages. To install the annotations associated with the Affymetrix
Human Genome U95 V 2.0, and with Gene Ontology, use
> biocLite(c("hgu95av2.db", "GO.db"))
Package installation is required only once per R installation. View a
full list of available
To use the
GO.db package, evaluate the commands
These commands are required once in each R session.
[ Back to top ]
Exploring Package Content
Packages have extensive help pages, and include vignettes highlighting
common use cases. The help pages and vignettes are available from
within R. After loading a package, use syntax like
to obtain an overview of help on the
GO.db package, and the
AnnotationDbi package is used by most
packages. View the vignettes in the
AnnotationDbi package with
To view vignettes (providing a more comprehensive introduction to
package functionality) in the
AnnotationDbi package. Use
To open a web page containing comprehensive help resources.
[ Back to top ]
The following guides the user through key annotation packages. Users
interested in how to create custom chip packages should see the
vignettes in the
AnnotationForge package. There is additional
information in the
GenomicFeatures packages for how to use some of the extra tools
provided. You can also refer to the complete list of annotation
Almost all annotations require the
AnnotationDbi package. This
package will be automatically installed for you if you install
another ".db" annotation package using biocLite(). It contains the
code to allow annotation mapping objects to be made and manipulated
as well as code to use the select methods etc..
OrganismDbi allows meta packages that enable the user to access
resources from several different packages as if they were coming
from one place. So for example the Homo.sapiens package is enabled
by OrganismDbi and allows the user to get access to GO.db, the
associated organism package IDs and the related transcript data for
the hg19 build of the human genome all as if it were contained in a
single convenient object.
GenomicFeatures allows the existance of TranscriptDb objects and
allows convenient representation of ranges from Transcritomes.
There are accessors for things like exons, transcripts as well as
the select method for retrieving data from packages supported by
AnnotationForge documents and assists in the creation of some kinds
of custom annotation packages.
This is the base level package for dealing with annotation questions
that involve categorical data.
This builds on what is found in Category so that you can do
hypergeometric testing using the Gene Ontology found in the GO.db
This package contains many helpful tools for making use of
This package is a great way to pull annotation data directly from
web based annotation resources. Such data is extremely "current", so
it is a good idea to save and locally manage the data that you pull
down from biomaRt so that your code will be reproducible.
Types of Annotation Packages
- Organism annotation packages contain all the gene based data for an entire
organism. All Organism packages are named like this:
org."Xx"."yy".db. Where "Xx" is the abbreviation for Genus and
species. And "yy" is the source of the central ID that is used to
tie all the data together. Some examples are:
which is for Homo sapiens and is based upon Entrez Gene IDs. And
which is for Arabidopsis thaliana and is based on the tair IDs.
- TransriptDb packages contain range and chromosome information for
specific transcriptomes. These are based on a particular genome
build and are are the place to look for where a
gene/transcripts/exon coordinate information is relative to a
genome. These are also named in a way that tells you about where
the data came from and can be generated with functions contained in
the GenomicFeatures package.
- OrganismDb packages are named for the species they represent (such
as the Homo.sapiens package). These packages contain references to
other key annotations packages and can thus represent all the
underlying data as if it were coming from one place. So for
example, the Homo.sapiens package can allow you to retrieve data
about the ranges of a genes transcripts at the same time that you
extract it's gene name because it represents both a the
transcriptome and the relevant org package for Homo sapiens. These
can be generated using functions in the OrganismDbi package if you
have specific packages that you want to link together.
- There are also packages for questions about general systems biology
data. Some examples of this are:
for accessing data that pertains to Kyoto Encyclopedia of Genes and
Genomes. GO.db for
accessing data that pertains to the Gene
for accessing data that pertains to different protein family
identifiers and how they relate to each other.
- Chip annotation packages are for accessing only the data from one
specific platform at a time. These packages are named like this:
"platformName".db. Where "platformName" is the name of the chip
platform that these packages refer to. And example would be
which is for the hgu95av2 platform from Affymetrix.
- Inparanoid homology packages are for accessing inparanoid
data. hom."Xx".inp.db Where "Xx" is the abbreviation for Genus and
species. An example is
which contains inparanoid based mapping data between genes for Homo
sapiens and 35 other organisms.
- .db0 packages are for making custom platform specific
packages. These packages are named like this: "name".db0. Where
"name" is the name of the organism being represented. A list of the
available .db0 packages can be obtained by calling
available.db0Pkgs(). There is one of these for each supported
organism. An example would be
should not need these installed unless they plan to make custom chip
packages according the guidelines in the SQLForge vignette that is
included with the
AnnotationDbi package. These packages must be
upgraded before you attempt to update your custom chip packages as
they contain the source databases needed by the SQLForge code.
- For relevant Affymetrix platforms you may also want the cdf and
probe packages for that platform. These packages are named using
the following convention: "platformName"cdf and
"platformName"probe. Where "platformName" is the name of the chip
platform that these packages refer to.
[ Back to top ]