--- title: "Workshop: W4 - Gene, Genome, and Variant Annotation" output: BiocStyle::html_document: toc: false vignette: > % \VignetteIndexEntry{Workshop: W4 - Gene, Genome, and Variant Annotation} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({ library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) library(Homo.sapiens) library(biomaRt) library(AnnotationHub) library(rtracklayer) library(VariantAnnotation) }) ``` Author: Martin Morgan (mtmorgan@fredhutch.org)
Date: 7 September, 2015
Back to [Workshop Outline](Developer-Meeting-Workshop.html)
The material in this document requires _R_ version 3.2 and _Bioconductor_ version 3.1 ```{r configure-test} stopifnot( getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() >= "3.1" ) ``` # Annotation ## Gene model annotation resources -- `TxDb` packages [TxDb.Hsapiens.UCSC.hg19.knownGene][] ```{r gene-model-discovery} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb methods(class=class(txdb)) ``` `TxDb` objects - Curatated annotation resources -- http://bioconductor.org/packages/biocViews - Underlying sqlite database -- `dbfile(txdb)` - Make your own: `GenomicFeatures::makeTxDbFrom*()` Accessing gene models - `exons()`, `transcripts()`, `genes()`, `cds()` (coding sequence) - `promoters()` & friends - `exonsBy()` & friends -- exons by gene, transcript, ... - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` ```{r txdb-exons} exons(txdb) exonsBy(txdb, "tx") ``` ## Identifier mapping -- `OrgDb` ```{r org} library(org.Hs.eg.db) org.Hs.eg.db ``` `OrgDb` objects - Curated resources, underlying sqlite data base, like `TxDb` - make your own: [AnnotationForge][] (but see the AnnotationHub, below!) - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` `select()` - Vector of keys, desired columns - Specification of key type ```{r select} select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL") keytypes(org.Hs.eg.db) columns(org.Hs.eg.db) ``` Related functionality - `mapIds()` -- special case for mapping from 1 identifier to another - `OrganismDb` objects: combined `org.*`, `TxDb.*`, and other annotation resources for easy access ```{r organismdb} library(Homo.sapiens) select(Homo.sapiens, c("BRCA1", "PTEN"), c("TXNAME", "TXCHROM", "TXSTART", "TXEND"), "SYMBOL") ``` ## Other annotation resources -- `biomaRt`, `AnnotationHub` ### _biomaRt_ & friends http://biomart.org; _Bioconductor_ package [biomaRt][] ```{r biomart, eval=FALSE} ## NEEDS INTERNET ACCESS !! library(biomaRt) head(listMarts(), 3) ## list marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" substr(filterOptions(myFilter, ensembl), 1, 50) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` Other internet resources - [biomaRt](http://biomart.org) Ensembl and other annotations - [PSICQUIC](https://code.google.com/p/psicquic) Protein interactions - [uniprot.ws](http://uniprot.org) Protein annotations - [KEGGREST](http://www.genome.jp/kegg) KEGG pathways - [SRAdb](http://www.ncbi.nlm.nih.gov/sra) Sequencing experiments - [rtracklayer](http://genome.ucsc.edu) USCS genome tracks - [GEOquery](http://www.ncbi.nlm.nih.gov/geo/) Array and other data - [ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) Array and other data - ... ### _AnnotationHub_ - _Bioconductor_ package [AnnotationHub][] - Meant to ease use of 'consortium' and other genome-scale resources - Simplify discovery, retrieval, local management, and import to standard _Bioconductor_ representations Example: Ensembl 'GTF' files to _R_ / _Bioconductor_ GRanges and TxDb ```{r annotationhub-gtf, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() hub query(hub, c("Ensembl", "80", "gtf")) ## ensgtf = display(hub) # visual choice hub["AH47107"] gtf <- hub[["AH47107"]] gtf txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf) ``` Example: non-model organism `OrgDb` packages ```{r annotationhub-orgdb, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub, "OrgDb") ``` Example: Map Roadmap epigenomic marks to hg38 - Roadmap BED file as _GRanges_ ```{r annotationhub-roadmap, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2")) E126 <- hub[["AH29817"]] ``` - UCSC 'liftOver' file to map coordinates ```{r annotationhub-liftover, eval=FALSE} query(hub , c("hg19", "hg38", "chainfile")) chain <- hub[["AH14150"]] ``` - lift over -- possibly one-to-many mapping, so _GRanges_ to _GRangesList_ ```{r liftover, eval=FALSE} library(rtracklayer) E126hg38 <- liftOver(E126, chain) E126hg38 ``` # Annotating Variants Example: read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` [AnnotationHub]: http://bioconductor.org/packages/AnnotationHub [TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [ChIPseeker]: http://bioconductor.org/packages/ChIPseeker [VariantFiltering]: http://bioconductor.org/packages/VariantFiltering [AnnotationForge]: http://bioconductor.org/packages/AnnotationForge [biomaRt]: http://bioconductor.org/packages/biomaRt