22-26 July | CSAMA 2019

Description

There are various annotation packages provided by the Bioconductor project that can be used to incorporate additional information to results from high-throughput experiments. This can be as simple as mapping Ensembl IDs to corresponding HUGO gene symbols, to much more complex queries involving multiple data sources. We will briefly cover the various classes of annotation packages, what they contain, and how to use them efficiently.

Task

  1. Start with set of identifers that are measured

  2. Map to new identifiers.

Why:

  • more familiar to collaborators
  • can be used for further analyses.

As an example, RNA-Seq data may only have Entrez Gene IDs for each gene measured, and as part of the output you may want to include the gene symbols, which are more likely to be familiar to a Biologist.

What do we mean by annotation?

Map a known ID to other functional or positional information

Annotation sources

Package type Example
OrgDb org.Hs.eg.db
TxDb/EnsDb TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75
OrganismDb Homo.sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19
Others GO.db
AnnotationHub Online resource
biomaRt Online resource
ChipDb hugene20sttranscriptcluster.db

Interacting with AnnoDb packages

The main function is select:

AnnotationDbi::select(annopkg, keys, columns, keytype)

Where

  • annopkg is the annotation package

  • keys are the IDs that we know

  • columns are the values we want

  • keytype is the type of key used
    • if the keytype is the central key, it can remain unspecified

help: ?AnnotationDbi::select
other useful functions: columns, keytypes, mapIds

Simple Example

The data in the airway package is a RangedSummarizedExperiment constructed from an RNA-Seq experiment. Let map the ensembl gene identifiers to gene symbol.

library(airway)
library(org.Hs.eg.db)
data(airway)
ids = head(rownames(airway))
ids
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
select(org.Hs.eg.db, ids, "SYMBOL", "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##           ENSEMBL   SYMBOL
## 1 ENSG00000000003   TSPAN6
## 2 ENSG00000000005     TNMD
## 3 ENSG00000000419     DPM1
## 4 ENSG00000000457    SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938      FGR

Questions!

How do you know what the central keys are?

  • If it's a ChipDb, the central key are the manufacturer's probe IDs

  • It's sometimes in the name - org.Hs.eg.db, where 'eg' means Entrez Gene ID

  • You can see examples using e.g., head(keys(annopkg)), and infer from that

  • But note that it's never necessary to know the central key, as long as you specify the keytype

More questions!

What keytypes or columns are available for a given annotation package?

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

Another example

There is one issue with select however.

brca <- c("BRCA1", "BRCA2")
select(org.Hs.eg.db, brca, c("MAP", "ONTOLOGY"), "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
##   SYMBOL      MAP ONTOLOGY
## 1  BRCA1 17q21.31       BP
## 2  BRCA1 17q21.31       CC
## 3  BRCA1 17q21.31       MF
## 4  BRCA2  13q13.1       BP
## 5  BRCA2  13q13.1       CC
## 6  BRCA2  13q13.1       MF