Original Authors: Martin Morgan, Sonali Arora
Presenting Author: Martin Morgan (martin.morgan@roswellpark.org)
Date: 11 July, 2016 Back: Monday labs
Objective: Learn about Bioconductor resources for gene and genome annotation.
Lessons learned:
org.*
packages for mapping between gene symbols.TxDb.*
and ensembldb
packages for working with gene models.AnnotationHub
to easily obtain select consortium-level resourcesbiomaRt
and other internet-based resources for highly flexible annotation.VariantAnnotation
and VariantFiltering
for annotating SNPs.Organism-level (‘org’) packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org.<Sp>.<id>.db
(e.g. org.Sc.sgd.db) where <Sp>
is a 2-letter abbreviation of the organism (e.g. Sc
for Saccharomyces cerevisiae) and <id>
is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd
for gene identifiers assigned by the Saccharomyces Genome Database, or eg
for Entrez gene ids). The “How to use the ‘.db’ annotation packages” vignette in the AnnotationDbi package (org packages are only one type of “.db” annotation packages) is a key reference. The ‘.db’ and most other Bioconductor annotation packages are updated every 6 months.
Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb
objects, with more specific classes named OrgDb
, ChipDb
or TranscriptDb
objects. Methods that can be applied to these objects include cols()
, keys()
, keytypes()
and select()
. Common operations for retrieving annotations are summarized in the table.
Category | Function | Description |
---|---|---|
Discover | columns() |
List the kinds of columns that can be returned |
keytypes() |
List columns that can be used as keys | |
keys() |
List values that can be expected for a given keytype | |
select() |
Retrieve annotations matching keys , keytype and columns |
|
Manipulate | setdiff() , union() , intersect() |
Operations on sets |
duplicated() , unique() |
Mark or remove duplicates | |
%in% , match() |
Find matches | |
any() , all() |
Are any TRUE ? Are all? |
|
merge() |
Combine two different based on shared keys | |
GRanges* |
transcripts() , exons() , cds() |
Features (transcripts, exons, coding sequence) as GRanges . |
transcriptsBy() , exonsBy() |
Features group by gene, transcript, etc., as GRangesList . |
|
cdsBy() |
A short summary of select Bioconductor packages enabling web-based queries is in following Table.
Package | Description |
---|---|
AnnotationHub | Ensembl, Encode, dbSNP, UCSC data objects |
biomaRt | Ensembl and other annotations |
PSICQUIC | Protein interactions |
uniprot.ws | Protein annotations |
KEGGREST | KEGG pathways |
SRAdb | Sequencing experiments. |
rtracklayer | genome tracks. |
GEOquery | Array and other data |
ArrayExpress | Array and other data |
Exercise: This exercise illustrates basic use of the `select’ interface to annotation packages.
OrgDb
object for the org.Hs.eg.db package. Use the columns()
method to discover which sorts of annotations can be extracted from it.keys()
method to extract ENSEMBL identifiers and then pass those keys in to the select()
method in such a way that you extract the SYMBOL (gene symbol) and GENENAME information for each. Use the following ENSEMBL ids.ensids <- c("ENSG00000130720", "ENSG00000103257", "ENSG00000156414",
"ENSG00000144644", "ENSG00000159307", "ENSG00000144485")
Solution The OrgDb
object is named org.Hs.eg.db
.
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
## [7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [19] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID"
## [7] "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [19] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
cols <- c("SYMBOL", "GENENAME")
select(org.Hs.eg.db, keys=ensids, columns=cols, keytype="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL GENENAME
## 1 ENSG00000130720 FIBCD1 fibrinogen C domain containing 1
## 2 ENSG00000103257 SLC7A5 solute carrier family 7 member 5
## 3 ENSG00000156414 TDRD9 tudor domain containing 9
## 4 ENSG00000144644 GADL1 glutamate decarboxylase like 1
## 5 ENSG00000159307 SCUBE1 signal peptide, CUB domain and EGF like domain containing 1
## 6 ENSG00000144485 HES6 hes family bHLH transcription factor 6
Exercise
Internet access required for this exercise
getBM()
. In addition to the mart to be accessed, this function takes filters and attributes as arguments. Use filterOptions()
and listAttributes()
to discover values for these arguments. Call getBM()
using filters and attributes of your choosing.Solution
## NEEDS INTERNET ACCESS !!
library(biomaRt)
head(listMarts(), 3) ## list the marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <- ## fully specified mart
useMart("ensembl", dataset = "hsapiens_gene_ensembl")
head(listFilters(ensembl), 3) ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3) ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")
## assemble and query the mart
res <- getBM(attributes = myAttributes, filters = myFilter,
values = myValues, mart = ensembl)
Exercise
As an optional exercise to be completed after Tuesday’s lab, annotate the genes that are differentially expressed in the DESeq2 laboratory, e.g., find the GENENAME associated with the five most differentially expressed genes. Do these make biological sense? Can you merge()
the annotation results with the `top table’ results to provide a statistically and biologically informative summary?
There are a diversity of packages and classes available for representing large genomes. Several include:
available.genomes()
for pre-packaged genomes, and the vignette ‘How to forge a BSgenome data package’ in theFaFile()
(Rsamtools) for accessing indexed FASTA files.Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.
The rtracklayer package allows us to query the UCSC genome browser, as well as providing import()
and export()
functions for common annotation file formats like GFF, GTF, and BED. The exercise below illustrates some of the functionality of rtracklayer.
Exercise
This exercise uses annotation resources to go from a gene symbol ‘BRCA1’ through to the genomic coordinates of each transcript associated with the gene, and finally to the DNA sequences of the transcripts.
select
command.TXNAME
) corresponding to the BRCA1 Entrez identifier. (The ‘org*’ packages are based on information from NCBI, where Entrez identifiers are labeled ENTREZID; the ’TxDb*’ package we are using is from UCSC, where Entrez identifiers are labelled GENEID).Use the cdsBy()
function to retrieve the genomic coordinates of all coding sequences grouped by transcript, and select the transcripts corresponding to the identifiers we’re interested in. The coding sequences are returned as an GRangesList
, where each element of the list is a GRanges
object representing the exons in the coding sequence. As a sanity check, ensure that the sum of the widths of the exons in each coding sequence is evenly divisble by 3 (the R ‘modulus’ operator %%
returns the remainder of the division of one number by another, and might be helpful in this case).
Visualize the transcripts in genomic coordinates using the Gviz package to construct a GeneRegionTrack
, and plotting it using plotTracks()
.
Use the Bsgenome.Hsapiens.UCSC.hg19 package and extractTranscriptSeqs()
function to extract the DNA sequence of each transcript.
Solution
Retrieve the Entrez identifier corresponding to the BRCA1 gene symbol
library(org.Hs.eg.db)
eid <- select(org.Hs.eg.db, "BRCA1", "ENTREZID", "SYMBOL")[["ENTREZID"]]
## 'select()' returned 1:1 mapping between keys and columns
Map from Entrez gene identifier to transcript name
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txid <- select(txdb, eid, "TXNAME", "GENEID")[["TXNAME"]]
Retrieve all coding sequences grouped by transcript, and select those matching the transcript ids of interest, verifying that each coding sequence width is a multiple of 3
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
brca1cds <- cds[names(cds) %in% txid]
class(brca1cds)
## [1] "GRangesList"
## attr(,"package")
## [1] "GenomicRanges"
length(brca1cds)
## [1] 20
brca1cds[[1]] # exons in cds
## GRanges object with 22 ranges and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr17 [41276034, 41276113] - | 186246 <NA> 1
## [2] chr17 [41267743, 41267796] - | 186245 <NA> 2
## [3] chr17 [41258473, 41258550] - | 186243 <NA> 3
## [4] chr17 [41256885, 41256973] - | 186241 <NA> 4
## [5] chr17 [41256139, 41256278] - | 186240 <NA> 5
## ... ... ... ... . ... ... ...
## [18] chr17 [41209069, 41209152] - | 186218 <NA> 18
## [19] chr17 [41203080, 41203134] - | 186217 <NA> 19
## [20] chr17 [41201138, 41201211] - | 186215 <NA> 20
## [21] chr17 [41199660, 41199720] - | 186214 <NA> 21
## [22] chr17 [41197695, 41197819] - | 186212 <NA> 22
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
cdswidth <- width(brca1cds) # width of each exon
all((sum(cdswidth) %% 3) == 0) # sum within cds, modulus 3
## [1] TRUE
Visualize the BRCA1 transcirpts using Gviz (this package has an excellent vignette, vignette("Gviz")
)
library(Gviz)
grt <- GeneRegionTrack(txdb)
plotTracks(list(GenomeAxisTrack(), grt), chromosome="chr17",
from=min(start(unlist(brca1cds))),
to=max(end(unlist(brca1cds))))
## Warning in grid.Call.graphics(L_rect, x$x, x$y, x$width, x$height, resolveHJust(x$just, : semi-
## transparency is not supported on this device: reported only once per page
Extract the coding sequences of each transcript
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
tx_seq <- extractTranscriptSeqs(genome, brca1cds)
tx_seq
## A DNAStringSet instance of length 20
## width seq names
## [1] 2280 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGATACCCCAGATCCCCCACAGCCACTACTGA uc010whl.2
## [2] 5379 ATGAGCCTACAAGAAAGTACGAGATTTAGTCAA...TGATACCCCAGATCCCCCACAGCCACTACTGA uc002icp.4
## [3] 522 ATGGATGCTGAGTTTGTGTGTGAACGGACACTG...TGATACCCCAGATCCCCCACAGCCACTACTGA uc010whm.2
## [4] 2100 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...AATGGCTTCCATGCAATTGGGCAGATGTGTGA uc002icu.3
## [5] 5451 ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCT...TGATACCCCAGATCCCCCACAGCCACTACTGA uc010cyx.3
## ... ... ...
## [16] 4095 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...ATCAAGAAGAGCAAAGCATGGATTCAAACTTA uc010cyy.1
## [17] 4095 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...ATCAAGAAGAGCAAAGCATGGATTCAAACTTA uc010whs.1
## [18] 3954 ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCT...ATCAAGAAGAGCAAAGCATGGATTCAAACTTA uc010cyz.2
## [19] 4017 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...ATCAAGAAGAGCAAAGCATGGATTCAAACTTA uc010cza.2
## [20] 3207 ATGAATGTAGAAAAGGCTGAATTCTGTAATAAA...ATCAAGAAGAGCAAAGCATGGATTCAAACTTA uc010wht.1
Intron coordinates can be identified by first calculating the range of the genome (from the start of the first exon to the end of the last exon) covered by each transcript, and then taking the (algebraic) set difference between this and the genomic coordinates covered by each exon
introns <- psetdiff(unlist(range(brca1cds)), brca1cds)
Retrieve the intronic sequences with getSeq()
(these are not assembled, the way that extractTranscriptSeqs()
assembles exon sequences into mature transcripts); note that introns start and end with the appropriate acceptor and donor site sequences.
seq <- getSeq(genome, introns)
names(seq)
## [1] "uc010whl.2" "uc002icp.4" "uc010whm.2" "uc002icu.3" "uc010cyx.3" "uc002icq.3" "uc002ict.3"
## [8] "uc010whn.2" "uc010who.3" "uc010whp.2" "uc010whq.1" "uc002idc.1" "uc010whr.1" "uc002idd.3"
## [15] "uc002ide.1" "uc010cyy.1" "uc010whs.1" "uc010cyz.2" "uc010cza.2" "uc010wht.1"
seq[["uc010whl.2"]] # 21 introns
## A DNAStringSet instance of length 21
## width seq
## [1] 1840 GTAAGGTGCCTGCATGTACCTGTGCTATATGGGGTCCTTTTGC...GAATGAATTGACACTAATCTCTGCTTGTGTTCTCTGTCTCCAG
## [2] 1417 GTAAGTATTGGGTGCCCTGTCAGAGAGGGAGGACACAATATTC...CTACTTTGACACTTTGAATGCTCTTTCCTTCCTGGGGATCCAG
## [3] 1868 GTAAGAGCCTGGGAGAACCCCAGAGTTCCAGCACCAGCCTTTG...TGCAGATTACTGCAGTGATTTTACATCTAAATGTCCATTTTAG
## [4] 5934 GTAAAGCTCCCTCCCTCAAGTTGACAAAAATCTCACCCCACCA...TCTCTCCATTCCCCTGTCCCTCTCTCTTCCTCTCTTCTTCCAG
## [5] 6197 GTAAGTACTTGATGTTACAAACTAACCAGAGATATTCATTCAG...TCTCTTTCTCTTATCCTGATGGGTTGTGTTTGGTTTCTTTCAG
## ... ... ...
## [17] 4241 GTAAAACCATTTGTTTTCTTCTTCTTCTTCTTCTTCTTTTCTT...ACTGGCCAACAATTGCTTGACTGTTCTTTACCATACTGTTTAG
## [18] 606 GTAAGTGTTGAATATCCCAAGAATGACACTCAAGTGCTGTCCA...TTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAG
## [19] 1499 GTATATAATTTGGTAATGATGCTAGGTTGGAAGCAACCACAGT...ATAATCACTTGCTGAGTGTGTTTCTCAAACAATTTAATTTCAG
## [20] 9192 GTAAGTTTGAATGTGTTATGTGGCTCCATTATTAGCTTTTGTT...AGGAAGTAAATTAAATTGTTCTTTCTTTCTTTATAATTTATAG
## [21] 8237 GTAAGTCAGCACAAGAGTGTATTAATTTGGGATTCCTATGATT...TGAGCCTCATTTATTTTCTTTTTCTCCCCCCCTACCCTGCTAG
Exercise
Internet access required for this exercise
Here we use rtracklayer to retrieve estrogen receptor binding sites identified across cell lines in the ENCODE project. We focus on binding sites in the vicinity of a particularly interesting region of interest.
GRanges
instance with appropriate genomic coordinates. Our region corresponds to 10Mb up- and down-stream of a particular gene.Solution
Define the region of interest
library(GenomicRanges)
roi <- GRanges("chr10", IRanges(92106877, 112106876, names="ENSG00000099194"))
Create a session
library(rtracklayer)
session <- browserSession()
Query the UCSC for a particular track, table, and transcription factor, in our region of interest
trackName <- "wgEncodeRegTfbsClusteredV2"
tableName <- "wgEncodeRegTfbsClusteredV2"
trFactor <- "ERalpha_a"
ucscTable <- getTable(ucscTableQuery(session, track=trackName,
range=roi, table=tableName, name=trFactor))
Visualize the result
plot(score ~ chromStart, ucscTable, pch="+")
abline(v=start(roi) + (end(roi) - start(roi) + 1) / 2, col="blue")
AnnotationHub is a data base of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms, and the NHLBI grasp2db data base of GWAS results. There are many interesting ways in which these resources can be used. Examples include
Unfortunately, AnnotationHub makes extensive use of internet resources and so we will not pursue it in this course; see the vignettes that come with the pacakge, for instance AnnotationHub HOW-TOs.
Bioconductor provides facilities for reading VCF files. These work very well with the annotation resources described above, so for instance it is straight-forward to identify variants in coding or other regions of interest.
To develop a sense of the capabilities available, work through the VariantAnnotation vignette ‘Introduction to Variant Annotation’, and the VariantFiltering vignette.