19-24 June | CSAMA 2022

What is annotation data?

What is annotation data?

What is annotation data?

  • Annotation data: usually external data that is used to make sense of measured entities.

Typical annotation use cases

  • get HUGO gene symbols for gene identifiers (e.g. Entrezgene, Ensembl…).
  • map positions on the genome to transcripts (genes).
  • map genes to biological pathways.
  • assign peptide sequences to proteins.
  • identify metabolites based on fragment spectra.
  • Annotation: map one type of ID to another type of ID.

Annotation resources

Annotation resources in Bioconductor

  • Annotation packages:
    • org.Hs.eg.db
    • BSgenome.Hsapiens.UCSC.hg19
    • TxDb.Hsapiens.UCSC.hg19.knownGene
    • GO.db
  • AnnotationHub: query and retrieve annotations, cache them locally.
  • Online resources: biomaRt, KEGGREST.

Annotation packages in Bioconductor

  • org.Hs.eg.db: package with various annotations for homo sapiens.
BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
org.Hs.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2022-Mar17
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2022-03-10
## | GOEGSOURCEDATE: 2022-Mar17
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: 
## | GPSOURCEDATE: 2022-Nov23
## | ENSOURCEDATE: 2021-Dec21
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Fri Apr  1 14:42:16 2022

How can we get access to the data in a Bioconductor annotation resource?

A common annotation interface: AnnotationDbi

The main function is select:

AnnotationDbi::select(anno, keys, columns, keytype)

Where

  • anno is the annotation package (or object)
  • keys are the IDs that we have
  • columns are the values we want
  • keytype is the type of key used

Simple Example

  • Task: annotate gene identifiers to HUGO symbols.
  • The airway package provides a RangedSummarizedExperiment with results from an RNA-Seq experiment.
  • Genes are identified by their Ensembl IDs.
library(airway)
data(airway)
ids <- head(rownames(airway))
ids
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
select(org.Hs.eg.db, ids, "SYMBOL", "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##           ENSEMBL   SYMBOL
## 1 ENSG00000000003   TSPAN6
## 2 ENSG00000000005     TNMD
## 3 ENSG00000000419     DPM1
## 4 ENSG00000000457    SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938      FGR

What annotations are provided?

  • Use columns to list annotations available in specific annotation object.
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"
  • Use keytypes to list supported key types.
keytypes(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

Another example

  • Get OMIM IDs for two genes.
brca <- c("BRCA1", "BRCA2")
select(org.Hs.eg.db, brca, c("GENENAME", "OMIM"), "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
##    SYMBOL                    GENENAME   OMIM
## 1   BRCA1 BRCA1 DNA repair associated 113705
## 2   BRCA1 BRCA1 DNA repair associated 114480
## 3   BRCA1 BRCA1 DNA repair associated 604370
## 4   BRCA1 BRCA1 DNA repair associated 614320
## 5   BRCA1 BRCA1 DNA repair associated 617883
## 6   BRCA2 BRCA2 DNA repair associated 114480
## 7   BRCA2 BRCA2 DNA repair associated 155255
## 8   BRCA2 BRCA2 DNA repair associated 176807
## 9   BRCA2 BRCA2 DNA repair associated 194070
## 10  BRCA2 BRCA2 DNA repair associated 600185
## 11  BRCA2 BRCA2 DNA repair associated 605724
## 12  BRCA2 BRCA2 DNA repair associated 612555
## 13  BRCA2 BRCA2 DNA repair associated 613029
## 14  BRCA2 BRCA2 DNA repair associated 613347
  • We no longer have a 1:1 mapping!

Alternative: the mapIds function

  • Same as select (but for single annotations!).
  • Parameter multiVals allows to specify how to deal with 1:many mappings.
mapIds(org.Hs.eg.db, brca, "OMIM", "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
##    BRCA1    BRCA2 
## "113705" "114480"
  • OK, but where’s the rest?

Choices for multiVals

  • multiVals = "first" (default): take the first element.
  • multiVals = "asNA": NA is returned for any 1:many mapping.
  • multiVals = "list": multiple elements are collapsed into a list.
  • multiVals = "CharacterList": results are returned as a CharacterList.
mapIds(org.Hs.eg.db, brca, "OMIM", "SYMBOL", multiVals = "CharacterList")
## 'select()' returned 1:many mapping between keys and columns
## CharacterList of length 2
## [["BRCA1"]] 113705 114480 604370 614320 617883
## [["BRCA2"]] 114480 155255 176807 194070 600185 605724 612555 613029 613347

But…

  • AnnotationDbi is a very powerful framework, but:
  • For some annotations or queries different approaches might be more suitable.
  • Not guaranteed that all annotation packages/objects support the AnnotationDbi framework.

What about positional annotation?

  • Annotation for positions on the genome (exons, transcripts, genes).
  • Annotation for positions along protein sequences (protein domains…).
  • Mostly used for gene quantification in RNA-seq data.

Positional annotations: TxDb and EnsDb objects

  • TxDb (GenomicFeatures package) and EnsDb (ensembldb package) objects contain positional annotations.
  • EnsDb contain additional annotations such as gene symbols, Entrezgene IDs, GC content, gene/transcript biotypes and protein annotations.
  • TxDb and EnsDb resources can be installed as packages (e.g. TxDb.Hsapiens.UCSC.hg19.knownGene, EnsDb.Hsapiens.v86).
  • Preferred way: through AnnotationHub.
  • Why?
    • Provides annotation databases for all species, all Ensembl releases.
    • Guarantees reproducibility.

Query AnnotationHub

  • Search annotation resource for a specific species and release.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("EnsDb", "hsapiens", "102"))
## AnnotationHub with 1 record
## # snapshotDate(): 2022-05-13
## # names(): AH89180
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2020-10-27
## # $title: Ensembl 102 EnsDb for Homo sapiens
## # $description: Gene and protein annotations for Homo sapiens based on Ensem...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("102", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## #   "Protein", "Transcript") 
## # retrieve record with 'object[["AH89180"]]'

Fetch from AnnotationHub

  • Download and cache this annotation.
edb <- ah[["AH89180"]]
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.6
## |Creation time: Sat Dec 19 21:24:06 2020
## |ensembl_version: 102
## |ensembl_host: localhost
## |Organism: Homo sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.1
## | No. of genes: 68001.
## | No. of transcripts: 254853.
## |Protein data available.

Extract annotations from an EnsDb or TxDb

  • Using AnnotationDbi:
columns(edb)
##  [1] "CANONICALTRANSCRIPT" "DESCRIPTION"         "ENTREZID"           
##  [4] "EXONID"              "EXONIDX"             "EXONSEQEND"         
##  [7] "EXONSEQSTART"        "GCCONTENT"           "GENEBIOTYPE"        
## [10] "GENEID"              "GENEIDVERSION"       "GENENAME"           
## [13] "GENESEQEND"          "GENESEQSTART"        "INTERPROACCESSION"  
## [16] "ISCIRCULAR"          "PROTDOMEND"          "PROTDOMSTART"       
## [19] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
## [22] "PROTEINSEQUENCE"     "SEQCOORDSYSTEM"      "SEQLENGTH"          
## [25] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
## [28] "TXBIOTYPE"           "TXCDSSEQEND"         "TXCDSSEQSTART"      
## [31] "TXID"                "TXIDVERSION"         "TXNAME"             
## [34] "TXSEQEND"            "TXSEQSTART"          "TXSUPPORTLEVEL"     
## [37] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE"
select(edb, "ENSG00000139618",
       c("SYMBOL", "SEQNAME"), "GENEID")
##            GENEID SYMBOL SEQNAME
## 1 ENSG00000139618  BRCA2      13

Extract positional annototations

  • genes, transcripts,exons to extract their genomic coordinates.
  • Result returned as GRanges object (with additional metadata columns).
genes(edb)
## GRanges object with 68001 ranges and 9 metadata columns:
##                   seqnames            ranges strand |         gene_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENSG00000223972        1       11869-14409      + | ENSG00000223972
##   ENSG00000227232        1       14404-29570      - | ENSG00000227232
##               ...      ...               ...    ... .             ...
##   ENSG00000231514        Y 26626520-26627159      - | ENSG00000231514
##   ENSG00000235857        Y 56855244-56855488      + | ENSG00000235857
##                     gene_name           gene_biotype seq_coord_system
##                   <character>            <character>      <character>
##   ENSG00000223972     DDX11L1 transcribed_unproces..       chromosome
##   ENSG00000227232      WASH7P unprocessed_pseudogene       chromosome
##               ...         ...                    ...              ...
##   ENSG00000231514      CCNQP2   processed_pseudogene       chromosome
##   ENSG00000235857     CTBP2P1   processed_pseudogene       chromosome
##                              description   gene_id_version canonical_transcript
##                              <character>       <character>          <character>
##   ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5      ENST00000456328
##   ENSG00000227232 WASP family homolog .. ENSG00000227232.5      ENST00000488147
##               ...                    ...               ...                  ...
##   ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1      ENST00000435741
##   ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1      ENST00000431853
##                        symbol                   entrezid
##                   <character>                     <list>
##   ENSG00000223972     DDX11L1 84771,727856,100287102,...
##   ENSG00000227232      WASH7P                     653635
##               ...         ...                        ...
##   ENSG00000231514      CCNQP2                       <NA>
##   ENSG00000235857     CTBP2P1                       <NA>
##   -------
##   seqinfo: 455 sequences (1 circular) from GRCh38 genome

Query data for specific genes

  • Get all transcripts for a gene with the gene symbol (name) BRCA2.
transcripts(edb, filter = ~ symbol == "BRCA2")
## GRanges object with 12 ranges and 10 metadata columns:
##                   seqnames            ranges strand |           tx_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENST00000671466       13 32315086-32316527      + | ENST00000671466
##   ENST00000544455       13 32315480-32399668      + | ENST00000544455
##               ...      ...               ...    ... .             ...
##   ENST00000666593       13 32380007-32394933      + | ENST00000666593
##   ENST00000533776       13 32396809-32398448      + | ENST00000533776
##                               tx_biotype tx_cds_seq_start tx_cds_seq_end
##                              <character>        <integer>      <integer>
##   ENST00000671466         protein_coding         32316461       32316527
##   ENST00000544455         protein_coding         32316461       32398770
##               ...                    ...              ...            ...
##   ENST00000666593 nonsense_mediated_de..         32380007       32383930
##   ENST00000533776        retained_intron             <NA>           <NA>
##                           gene_id tx_support_level     tx_id_version gc_content
##                       <character>        <integer>       <character>  <numeric>
##   ENST00000671466 ENSG00000139618             <NA> ENST00000671466.1    39.7590
##   ENST00000544455 ENSG00000139618                1 ENST00000544455.5    36.2436
##               ...             ...              ...               ...        ...
##   ENST00000666593 ENSG00000139618             <NA> ENST00000666593.1    39.5793
##   ENST00000533776 ENSG00000139618                3 ENST00000533776.1    39.9618
##                           tx_name      symbol
##                       <character> <character>
##   ENST00000671466 ENST00000671466       BRCA2
##   ENST00000544455 ENST00000544455       BRCA2
##               ...             ...         ...
##   ENST00000666593 ENST00000666593       BRCA2
##   ENST00000533776 ENST00000533776       BRCA2
##   -------
##   seqinfo: 1 sequence from GRCh38 genome

For read counting/gene quantification

  • Extract exons grouped by transcript for all genes encoded on chromosome Y.
exs <- exonsBy(edb, by = "tx", filter = ~ seq_name == "Y")
exs
## GRangesList object of length 816:
## $ENST00000155093
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames          ranges strand |         exon_id exon_rank
##          <Rle>       <IRanges>  <Rle> |     <character> <integer>
##   [1]        Y 2935381-2935769      + | ENSE00001648585         1
##   [2]        Y 2953909-2953997      + | ENSE00003889480         2
##   ...      ...             ...    ... .             ...       ...
##   [7]        Y 2977940-2978080      + | ENSE00003891660         7
##   [8]        Y 2978810-2982506      + | ENSE00003895708         8
##   -------
##   seqinfo: 1 sequence from GRCh38 genome
## 
## $ENST00000215473
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames          ranges strand |         exon_id exon_rank
##          <Rle>       <IRanges>  <Rle> |     <character> <integer>
##   [1]        Y 5056088-5057459      + | ENSE00001436852         1
##   [2]        Y 5098215-5101437      + | ENSE00003741448         2
##   -------
##   seqinfo: 1 sequence from GRCh38 genome
## 
## $ENST00000215479
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames          ranges strand |         exon_id exon_rank
##          <Rle>       <IRanges>  <Rle> |     <character> <integer>
##   [1]        Y 6873972-6874027      - | ENSE00001348274         1
##   [2]        Y 6872555-6872620      - | ENSE00001671586         2
##   ...      ...             ...    ... .             ...       ...
##   [5]        Y 6868037-6868462      - | ENSE00001667251         5
##   [6]        Y 6865926-6866078      - | ENSE00003843155         6
##   -------
##   seqinfo: 1 sequence from GRCh38 genome
## 
## ...
## <813 more elements>

Genomic sequences

  • BSgenome packages (e.g. BSgenome.Hsapiens.UCSC.hg19).
  • AnnotationHub:
query(ah, c("TwoBitFile", "GRCh38"))
## AnnotationHub with 115 records
## # snapshotDate(): 2022-05-13
## # $dataprovider: Ensembl
## # $species: Homo sapiens, homo sapiens
## # $rdataclass: TwoBitFile
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH49722"]]' 
## 
##              title                                           
##   AH49722  | Homo_sapiens.GRCh38.cdna.all.2bit               
##   AH49723  | Homo_sapiens.GRCh38.dna.primary_assembly.2bit   
##   ...        ...                                             
##   AH103846 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
##   AH103847 | Homo_sapiens.GRCh38.ncrna.2bit

Extracting transcript sequences

  • Task: Get sequences for previously extracted transcripts.
gn <- ah[["AH49723"]]
## loading from cache
extractTranscriptSeqs(gn, exs)
## DNAStringSet object of length 816:
##       width seq                                             names               
##   [1]  5336 AGTGCTCCAGAGAAAGGCCGTC...ATAAATGTTTACTTGTATATGA ENST00000155093
##   [2]  4595 CTGGTGGTCCAGTACCTCCAAA...TGAGCCCTTCAGAAGACATTCT ENST00000215473
##   ...   ... ...
## [815]   615 GGAGGGAGTTGGGAGGAAAGCG...AATAAAAATATTGACATCACAA ENST00000670354
## [816]  2156 AGTTTCAGAAGTTCGTGGGTCA...ACGAGAGAAGAAAAGCAGAAAA ENST00000670518

Extracting genomic sequences

  • Get the genomic sequence for a particular region.
  • getSeq takes the data object and a GRanges defining which (sub)sequence to extract.
rng <- GRanges("X:10000-15000")
getSeq(gn, rng)
## DNAStringSet object of length 1:
##     width seq
## [1]  5001 NCTAACCCTAACCCTAACCCTAACCCTAACCCTA...AGGGGAGAGCAAGAGTGTGCGGGCGAGGGAGGA
  • See Biostrings package for more information on how to handle and process sequences.

Online resources: biomaRt

  • Biomart is an Ensembl service providing a web API to retrieve annotations.
  • Bioconductor’s biomaRt package allows to query Ensembl Biomart servers.
library(biomaRt)
listMarts()
##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 106
## 2   ENSEMBL_MART_MOUSE      Mouse strains 106
## 3     ENSEMBL_MART_SNP  Ensembl Variation 106
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
  • Connect to a Biomart server.
mart <- useMart("ENSEMBL_MART_ENSEMBL")

biomaRt data sets

  • And we can then check for the available data sets on a particular server.
listDatasets(mart) |> head()
##                        dataset                           description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)
## 3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
## 4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
## 5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
## 6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
##       version
## 1 ASM259213v1
## 2  fAstCal1.2
## 3 AnoCar2.0v2
## 4  bAquChr1.2
## 5    Midas_v5
## 6 ASM200744v2
  • We’ve got annotations for almost all species.

biomaRt queries

  • Connect to a particular Biomart data set.
mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl")
  • Query the resource using getBM(attributes, filters, values, mart).

where

  • attributes are the things we want (columns)
  • filters are the types of IDs we have (keytypes)
  • values are the IDs we have (keys)
  • mart is the mart object we set up

biomaRt attributes and filters

  • Use listAttributes and listFilters to get available annotations and supported filters.
listAttributes(mart) |> head()
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page
listFilters(mart) |> head()
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

biomaRt query

  • Example: get Ensembl gene ID and chromosome names for two genes.
sym <- c("BRCA1", "BRCA2")
getBM(c("ensembl_gene_id", "hgnc_symbol", "chromosome_name"),
      "hgnc_symbol", sym, mart)
##   ensembl_gene_id hgnc_symbol chromosome_name
## 1 ENSG00000012048       BRCA1              17
## 2 ENSG00000139618       BRCA2              13

Notes

  • Requires internet connection.
  • By default uses current release, but listEnsemblArchives() would allow to select URLs for previous (Ensembl) releases.

What else?

Last words

  • Stick whenever possible with one annotation resource (Ensembl, NCBI, UCSC).
  • Ensure you’re using the same annotation version through your full analysis (e.g. aligning reads, gene quantification, annotation of genes).
  • Prefer AnnotationHub over annotation packages: better control of the annotation version.

Thank you for your attention