Introduction to Bioconductor Annotation Resources

22-26 July | CSAMA 2019

Description

There are various annotation packages provided by the Bioconductor project that can be used to incorporate additional information to results from high-throughput experiments. This can be as simple as mapping Ensembl IDs to corresponding HUGO gene symbols, to much more complex queries involving multiple data sources. We will briefly cover the various classes of annotation packages, what they contain, and how to use them efficiently.

Task

Start with set of identifers that are measured
Map to new identifiers.

Why:

more familiar to collaborators
can be used for further analyses.

As an example, RNA-Seq data may only have Entrez Gene IDs for each gene measured, and as part of the output you may want to include the gene symbols, which are more likely to be familiar to a Biologist.

What do we mean by annotation?

Map a known ID to other functional or positional information

Annotation sources

Package type	Example
OrgDb	org.Hs.eg.db
TxDb/EnsDb	TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75
OrganismDb	Homo.sapiens
BSgenome	BSgenome.Hsapiens.UCSC.hg19
Others	GO.db
AnnotationHub	Online resource
biomaRt	Online resource
ChipDb	hugene20sttranscriptcluster.db

Interacting with AnnoDb packages

The main function is select:

AnnotationDbi::select(annopkg, keys, columns, keytype)

Where

annopkg is the annotation package
keys are the IDs that we know
columns are the values we want
keytype is the type of key used
- if the keytype is the central key, it can remain unspecified

help: ?AnnotationDbi::select
other useful functions: columns, keytypes, mapIds

Simple Example

The data in the airway package is a RangedSummarizedExperiment constructed from an RNA-Seq experiment. Let map the ensembl gene identifiers to gene symbol.

library(airway)
library(org.Hs.eg.db)
data(airway)
ids = head(rownames(airway))
ids

## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"

select(org.Hs.eg.db, ids, "SYMBOL", "ENSEMBL")

## 'select()' returned 1:1 mapping between keys and columns

##           ENSEMBL   SYMBOL
## 1 ENSG00000000003   TSPAN6
## 2 ENSG00000000005     TNMD
## 3 ENSG00000000419     DPM1
## 4 ENSG00000000457    SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938      FGR

Questions!

How do you know what the central keys are?

If it's a ChipDb, the central key are the manufacturer's probe IDs
It's sometimes in the name - org.Hs.eg.db, where 'eg' means Entrez Gene ID
You can see examples using e.g., head(keys(annopkg)), and infer from that
But note that it's never necessary to know the central key, as long as you specify the keytype

Another example

There is one issue with select however.

brca <- c("BRCA1", "BRCA2")
select(org.Hs.eg.db, brca, c("MAP", "ONTOLOGY"), "SYMBOL")

## 'select()' returned 1:many mapping between keys and columns

##   SYMBOL      MAP ONTOLOGY
## 1  BRCA1 17q21.31       BP
## 2  BRCA1 17q21.31       CC
## 3  BRCA1 17q21.31       MF
## 4  BRCA2  13q13.1       BP
## 5  BRCA2  13q13.1       CC
## 6  BRCA2  13q13.1       MF

The `mapIds` function

An alternative to select is mapIds, which gives control of duplicates

Same arguments as select with slight differences

The columns argument can only specify one column

The keytype argument must be specified

An additional argument, multiVals used to control duplicates

mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL")

## 'select()' returned 1:many mapping between keys and columns

## BRCA1 BRCA2 
##  "BP"  "BP"

Choices for multiVals

Default is first, where we just choose the first of the duplicates. Other choices are list, CharacterList, filter, asNA or a user-specified function.

mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL", multiVals = "list")

## 'select()' returned 1:many mapping between keys and columns

## $BRCA1
## [1] "BP" "CC" "MF"
## 
## $BRCA2
## [1] "BP" "CC" "MF"

mapIds(org.Hs.eg.db, brca, "ONTOLOGY", "SYMBOL", multiVals = "CharacterList")

## 'select()' returned 1:many mapping between keys and columns

## CharacterList of length 2
## [["BRCA1"]] BP CC MF
## [["BRCA2"]] BP CC MF

What about positional annotation?

TxDb packages

TxDb packages contain positional information; the contents can be inferred by the package name

TxDb.Species.Source.Build.Table

TxDb.Hsapiens.UCSC.hg19.knownGene
- Homo sapiens
- UCSC genome browser
- hg19 (their version of GRCh37)
- knownGene table

TxDb.Dmelanogaster.UCSC.dm3.ensGene TxDb.Athaliana.BioMart.plantsmart22

Transcript packages

As with ChipDb and OrgDb packages, select and mapIds can be used to make queries

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

select(TxDb.Hsapiens.UCSC.hg19.knownGene, c("1","10"),
       c("TXNAME","TXCHROM","TXSTART","TXEND"), "GENEID")

## 'select()' returned 1:many mapping between keys and columns

##   GENEID     TXNAME TXCHROM  TXSTART    TXEND
## 1      1 uc002qsd.4   chr19 58858172 58864865
## 2      1 uc002qsf.2   chr19 58859832 58874214
## 3     10 uc003wyw.1    chr8 18248755 18258723

But using `select` and `mapIds` are not how one normally uses TxDb objects…

giphy.com

GRanges

The normal use case for transcript packages is to extract positional information into a GRanges or GRangesList object.

Positional information can be extracted for transcripts(), genes(), coding sequences cds(), promoters() and exons().

An example is the genomic position of all genes:

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
gns

## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames              ranges strand |     gene_id
##            <Rle>           <IRanges>  <Rle> | <character>
##       1    chr19   58858172-58874214      - |           1
##      10     chr8   18248755-18258723      + |          10
##     100    chr20   43248163-43280376      - |         100
##    1000    chr18   25530930-25757445      - |        1000
##   10000     chr1 243651535-244006886      - |       10000
##     ...      ...                 ...    ... .         ...
##    9991     chr9 114979995-115095944      - |        9991
##    9992    chr21   35736323-35743440      + |        9992
##    9993    chr22   19023795-19109967      - |        9993
##    9994     chr6   90539619-90584155      + |        9994
##    9997    chr22   50961997-50964905      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

GRangesList

Or the genomic position of all transcripts by gene:

txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
txs

## GRangesList object of length 23459:
## $`1`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58858172-58864865      - |     70455  uc002qsd.4
##   [2]    chr19 58859832-58874214      - |     70456  uc002qsf.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`10`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr8 18248755-18258723      + |     31944  uc003wyw.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`100`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr20 43248163-43280376      - |     72132  uc002xmj.3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## ...
## <23456 more elements>

GRangesList

These by functions group based on another type of genomic features
help: ?GenomicFeatures::cdsBy

cds <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx", use.names=TRUE)
head(names(cds))

## [1] "uc010nxq.1" "uc001aal.1" "uc009vjk.2" "uc001aau.3" "uc021oeh.1"
## [6] "uc021oei.1"

cds["uc009vjk.2"]

## GRangesList object of length 1:
## $uc009vjk.2
## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames        ranges strand |    cds_id    cds_name exon_rank
##          <Rle>     <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 324343-324345      + |         5        <NA>         2
##   [2]     chr1 324439-325605      + |         6        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Why *Ranges objects

The main rationale for *Ranges objects is to allow us to easily select and subset data based on genomic position information. This is really powerful!

GRanges and GRangesLists act like vectors and lists, and can be subsetted using the [ or [[ function. As a really artificial example:

txs[txs %over% gns[1]]

## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58858172-58864865      - |     70455  uc002qsd.4
##   [2]    chr19 58859832-58874214      - |     70456  uc002qsf.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`162968`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58865723-58874214      - |     70457  uc002qsh.2
##   [2]    chr19 58865723-58874214      - |     70458  uc002qsi.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Another Example

Earlier we got the coding sequence regions by transcripts using cdsBy. This can be used in conjunction with the appropriate BSgenome package to extract the transcript sequences

# Lets do this for the first two transcripts
library(BSgenome.Hsapiens.UCSC.hg19)
seq_cds <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cds[1:2])
seq_cds

##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]   402 ATGAGTGAGAGCATCAACTTC...GACCCAGGCACAGGCATTAG uc010nxq.1
## [2]   918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG uc001aal.1

We can then translate the sequences using Biostrings::translate

translate(seq_cds)

##   A AAStringSet instance of length 2
##     width seq                                          names               
## [1]   134 MSESINFSHNLGQLLSPPRCV...GETQESVESRVLPGPRHRH* uc010nxq.1
## [2]   306 MVTEFIFLGLSDSQELQTFLF...MKTAIRQLRKWDAHSSVKF* uc001aal.1

It's so complicated? What if you want both?!

Organism.dplyr package

Combines the data from TxDb, Org.Db, GO.db associated packages into local database.
Allows functions from both org.* and TxDb.*
- keytypes(), select(), …
- exons(), promoters(), …
Allows for filtering and display of combined TxDb and Org.Db information through dplyr functions.

library(Organism.dplyr)

# src = src_organism("TxDb.Hsapiens.UCSC.hg19.knownGene")
# ?src_organism
src <- src_organism(dbpath = hg38light())
src

## src:  sqlite 3.22.0 [/home/lori/R/x86_64-pc-linux-gnu-library/3.6-BioC-3.10/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

Organism.dplyr

Get promoters from a TxDb object (we use a small version)

prm = promoters(src)
head(prm, 3)

## GRanges object with 3 ranges and 2 metadata columns:
##                     seqnames              ranges strand |     tx_id
##                        <Rle>           <IRanges>  <Rle> | <integer>
##   ENST00000336199.9     chr1 243843037-243845236      - |     18165
##   ENST00000366540.5     chr1 243843385-243845584      - |     18166
##   ENST00000263826.9     chr1 243843083-243845282      - |     18167
##                               tx_name
##                           <character>
##   ENST00000336199.9 ENST00000336199.9
##   ENST00000366540.5 ENST00000366540.5
##   ENST00000263826.9 ENST00000263826.9
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

length(prm)

## [1] 77

Organism.dplyr

Extract a table from the underlying database

tbl(src, "id")

## # Source:   table<id> [?? x 6]
## # Database: sqlite 3.22.0 []
##    entrez map      ensembl         symbol genename               alias   
##    <chr>  <chr>    <chr>           <chr>  <chr>                  <chr>   
##  1 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1B     
##  2 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein ABG     
##  3 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein GAB     
##  4 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein HYST2477
##  5 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1BG    
##  6 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  AAC2    
##  7 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT-2   
##  8 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  PNAT    
##  9 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT2    
## 10 100    20q13.12 ENSG00000196839 ADA    adenosine deaminase    ADA     
## # … with more rows

Organism.dplyr

Make a complex query between tables in the underlying database

inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
            dplyr::filter(symbol %in% c("ADA", "NAT2")) %>%
            dplyr::select(gene_chrom, gene_start, gene_end,
            gene_strand, symbol, alias, map)

## Joining, by = "entrez"

## # Source:   lazy query [?? x 7]
## # Database: sqlite 3.22.0 []
##   gene_chrom gene_start gene_end gene_strand symbol alias map     
##   <chr>           <int>    <int> <chr>       <chr>  <chr> <chr>   
## 1 chr8         18391245 18401218 +           NAT2   AAC2  8p22    
## 2 chr8         18391245 18401218 +           NAT2   NAT-2 8p22    
## 3 chr8         18391245 18401218 +           NAT2   PNAT  8p22    
## 4 chr8         18391245 18401218 +           NAT2   NAT2  8p22    
## 5 chr20        44619522 44652233 -           ADA    ADA   20q13.12

What about sequences?

BSgenome packages

BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using available.genomes or BiocManager::available

library(BSgenome)
head(BiocManager::available("BSgenome"))

## [1] "BSgenome"                               
## [2] "BSgenome.Alyrata.JGI.v1"                
## [3] "BSgenome.Amellifera.BeeBase.assembly4"  
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

BSgenome packages

We can load and inspect a BSgenome package

library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens

## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Deja vu?

giphy.com

BSgenome packages

The main accessor is getSeq, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region.

getSeq(Hsapiens, "chr1")

##   249250621-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

getSeq(Hsapiens, gns["5467",])

##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1] 85634 GCGGAGCGTGTGACGCTGCGG...TATTTAAGAGCTGACTGGAA 5467

The Biostrings package contains most of the code for dealing with these *StringSet objects - please see the Biostrings vignettes and help pages for more information.

Web based resources

giphy.com

AnnotationHub

AnnotationHub is a package that allows us to query and download many different annotation objects, without having to explicitly install them.

We will talk more about AnnotationHub when we talk about BiocFileCache and Bioconductor Hubs in a different lecture.

biomaRt

The biomaRt package allows queries to an Ensembl Biomart server. We can see the choices of servers that we can use:

library(biomaRt)
listMarts()

##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 97
## 2   ENSEMBL_MART_MOUSE      Mouse strains 97
## 3     ENSEMBL_MART_SNP  Ensembl Variation 97
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 97

biomaRt data sets

And we can then check for the available data sets on a particular server.

mart <- useMart("ENSEMBL_MART_ENSEMBL")
head(listDatasets(mart))

##                        dataset                           description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)
## 3   acarolinensis_gene_ensembl        Anole lizard genes (AnoCar2.0)
## 4    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
## 5        ahaastii_gene_ensembl    Great spotted kiwi genes (aptHaa1)
## 6    amelanoleuca_gene_ensembl                 Panda genes (ailMel1)
##       version
## 1 ASM259213v1
## 2  fAstCal1.2
## 3   AnoCar2.0
## 4    Midas_v5
## 5     aptHaa1
## 6     ailMel1

biomaRt queries

After setting up a mart object pointing to the server and data set that we care about, we can make queries. We first set up the mart object.

mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl")

Queries are of the form

getBM(attributes, filters, values, mart)

where

attributes are the things we want
filters are the types of IDs we have
values are the IDs we have
mart is the mart object we set up

biomaRt attributes and filters

Both attributes and filters have rather inscrutable names, but a listing can be accessed using

atrib <- listAttributes(mart)
filts <- listFilters(mart)
head(atrib)

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

head(filts)

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

biomaRt query

A simple example query

afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at")
getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart)

## Cache found

##   affy_hg_u95av2 hgnc_symbol
## 1      1007_s_at        DDR1
## 2        1001_at        TIE1
## 3        1000_at       MAPK3
## 4      1002_f_at     CYP2C19
## 5      1002_f_at

biomaRt

Current maintainer: Mike Smith

Other:

Johannes Rainer
ensembldb/EnsDb

Description

Task

What do we mean by annotation?

Annotation sources

Interacting with AnnoDb packages

Simple Example

Questions!

More questions!

Another example

The mapIds function

Choices for multiVals

What about positional annotation?

TxDb packages

Transcript packages

But using select and mapIds are not how one normally uses TxDb objects…

GRanges

GRangesList

GRangesList

Why *Ranges objects

Another Example

It's so complicated? What if you want both?!

Organism.dplyr package

Organism.dplyr

Organism.dplyr

Organism.dplyr

What about sequences?

BSgenome packages

BSgenome packages

Deja vu?

BSgenome packages

Web based resources

AnnotationHub

biomaRt

biomaRt data sets

biomaRt queries

biomaRt attributes and filters

biomaRt query

biomaRt

Other:

The `mapIds` function

But using `select` and `mapIds` are not how one normally uses TxDb objects…