About This Document »

Package: annotation
Built with Bioconductor (R): 2.13 (3.0.1)
Source Package ../annotation_0.99.83504.tar.gz
Windows Binary ../annotation_0.99.83504.zip
Mac OS 10.6 (Snow Leopard) ../annotation_0.99.83504.tgz
R Script
Last Built At: Fri Nov 22 15:34:48 PST 2013
First Committed: Thu May 16 08:53:30 PDT 2013
SVN Revision: 83504
Install with (under BioC 2.13):
source("http://bioconductor.org/workflows.R")
workflowInstall("annotation")
  

Annotating Ranges

Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM, gff, bed, and wig files, among others. Packages support common and advanced sequence manipulation operations such as trimming, transformation, and alignment. Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb package).

Sample Workflow

This workflow walks through the annotation of a generic set of ranges with Bioconductor packages. The ranges can be any user-defined region of interest or can be from a public file.

Data Preparation

As a first step, data are put into a GRanges object so we can take advantage of overlap operations and store identifiers as metadata columns.

The first set of ranges are variants from a dbSNP Variant Call Format (VCF) file. This file can be downloaded from the ftp site at NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/ and imported with readVcf() from the VariantAnnotation package. Alternatively, the file is available as a pre-parsed VCF object in the AnnotationHub.

library(VariantAnnotation)
library(AnnotationHub)
hub <- AnnotationHub()
vcf <- hub$dbSNP.organisms.human_9606.VCF.ByChromosome.22.12159.GIH.RData
dim(vcf)

## [1] 19698    88

When performing overlap operations the seqlevels and genome of the objects must match. Here were modify the VCF to match the TranscriptDb.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
head(seqlevels(txdb_hg19))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

seqlevels(vcf)

## [1] "22"

seqlevels(vcf) <- paste0("chr", seqlevels(vcf))
unique(genome(txdb_hg19))

## [1] "hg19"

genome(vcf) <- "hg19"

Sanity check to confirm we have matching seqlevels.

intersect(seqlevels(txdb_hg19), seqlevels(vcf))

## [1] "chr22"

The GRanges in a VCF object is in the 'rowData' slot.

gr_hg19 <- rowData(vcf)

The second set of ranges is a user-defined region of chromosome 4 in mouse. The idea here is that any region, known or unknown, can be annotated with the following steps.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb_mm10 <- TxDb.Mmusculus.UCSC.mm10.ensGene

We are creating the GRanges from scratch and can specify the seqlevels (chromosome names) to match the TranscriptDb.

head(seqlevels(txdb_mm10))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

gr_mm10 <- GRanges("chr4", IRanges(c(4e+06, 107889000), width = 1000))

Now assign the genome.

unique(genome(txdb_mm10))

## [1] "mm10"

genome(gr_mm10) <- "mm10"

Location in and Around Genes

locateVariants() in the VariantAnnotation package annotates ranges with transcript, exon, cds and gene ID's from a TranscriptDb. Various extractions are performed on the TranscriptDb (exonsBy(), transcripts(), cdsBy(), etc.) and the result is overlapped with the ranges. An appropriate GRangesList can also be supplied as the annotation. Different variants such as 'coding', 'fiveUTR', 'threeUTR', 'spliceSite', 'intron', 'promoter', and 'intergenic' can be searched for by passing the appropriate constructor as the 'region' argument. See ?locateVariants for details.

loc_hg19 <- locateVariants(gr_hg19, txdb_hg19, AllVariants())

## Warning: trimmed start values to be positive
## Warning: trimmed end values to be <= seqlengths

table(loc_hg19$LOCATION)

## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##          6      36416        246       1586       1690       8877 
##   promoter 
##       2425

loc_mm10 <- locateVariants(gr_mm10, txdb_mm10, AllVariants())

## Warning: trimmed start values to be positive
## Warning: trimmed end values to be <= seqlengths

table(loc_mm10$LOCATION)

## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic 
##          6          1          0          0          0          0 
##   promoter 
##         12

Annotate by ID

The ID's returned from locateVariants() can be used in select() to map to ID's in other annotation packages.

library(org.Hs.eg.db)
cols <- c("UNIPROT", "PFAM")
keys <- unique(loc_hg19$GENEID)
head(select(org.Hs.eg.db, keys, cols, keytype = "ENTREZID"))

##   ENTREZID UNIPROT    PFAM
## 1   150160  Q96SF2 PF00118
## 2   387590    <NA>    <NA>
## 3    23783    <NA>    <NA>
## 4   150165  Q5GH77 PF09815
## 5    27437    <NA>    <NA>
## 6   128954  Q2WGN9 PF00169

The 'keytype' argument specifies that the mouse TranscriptDb contains Ensembl instead of Entrez gene id's.

library(org.Mm.eg.db)
keys <- unique(loc_mm10$GENEID)
head(select(org.Mm.eg.db, keys, cols, keytype = "ENSEMBL"))

##              ENSEMBL UNIPROT    PFAM
## 1 ENSMUSG00000028236  Q7TQA3 PF00106
## 2 ENSMUSG00000028608  Q8BHG2 PF05907

Annotate by Position

Files stored in the AnnotationHub have been pre-processed into ranged-based R objects such as a GRanges, GAlignments and VCF. The positions in our GRanges can be overlapped with the ranges in the AnnotationHub files. This allows for easy subsetting of multiple files, resulting in only the ranges of interest.

Create a 'hub' from AnnotationHub and filter the files based on organism.

hub <- AnnotationHub()
filters(hub) <- list(Species = "Homo sapiens")
hg19_files <- names(hub)[grepl("hg19", names(hub))]
length(hg19_files)

## [1] 4609

Extract the matching ranges from the first 3 files.


ov_hg19 <- lapply(hg19_files[1:3], function(x) 
                  subsetByOverlaps(hub$[[x]], gr_hg19))

Take a look at the results.

names(ov_hg19) <- hg19_files[1:3]
lapply(ov_hg19, head, n = 3)

## $goldenpath.hg19.encodeDCC.wgEncodeCshlShortRnaSeq.wgEncodeCshlShortRnaSeqK562ChromatinShorttotalTapContigs.bedRnaElements_0.0.1.RData
## GRanges with 3 ranges and 5 metadata columns:
##       seqnames               ranges strand |
##          <Rle>            <IRanges>  <Rle> |
##   [1]    chr22 [17103696, 17103780]      + |
##   [2]    chr22 [19467446, 19467487]      + |
##   [3]    chr22 [19470231, 19470320]      + |
##                                                                                                       name
##                                                                                                <character>
##   [1]                                                        chr22:17103696-17103780(processed_transcript)
##   [2]                         chr22:19467446-19467487(processed_transcript|protein_coding|retained_intron)
##   [3] chr22:19470231-19470320(nonsense_mediated_decay|processed_transcript|protein_coding|retained_intron)
##           score     level    signif    score2
##       <integer> <numeric> <numeric> <integer>
##   [1]        12     11.52    0.6051        27
##   [2]         8      7.87    0.6953         9
##   [3]         6      5.64    0.6575        14
##   ---
##   seqlengths:
##         chr1     chr10     chr11     chr12 ...      chr9      chrM      chrX
##    249250621 135534747 135006516 133851895 ... 141213431     16571 155270560
## 
## $goldenpath.hg19.encodeDCC.wgEncodeUwDgf.wgEncodeUwDgfSknshraPk.narrowPeak_0.0.1.RData
## GRanges with 3 ranges and 6 metadata columns:
##       seqnames               ranges strand |        name     score
##          <Rle>            <IRanges>  <Rle> | <character> <integer>
##   [1]    chr22 [17680400, 17680550]      * |           .         0
##   [2]    chr22 [18225300, 18225450]      * |           .         0
##   [3]    chr22 [18313440, 18313590]      * |           .         0
##       signalValue    pValue    qValue      peak
##         <numeric> <numeric> <numeric> <integer>
##   [1]         134     77.88        -1        -1
##   [2]          57     16.42        -1        -1
##   [3]          28      5.06        -1        -1
##   ---
##   seqlengths:
##         chr1     chr10     chr11     chr12 ...      chr8      chr9      chrX
##    249250621 135534747 135006516 133851895 ... 146364022 141213431 155270560
## 
## $goldenpath.hg19.encodeDCC.wgEncodeAwgDnaseUniform.wgEncodeAwgDnaseUwAg04449UniPk.narrowPeak_0.0.1.RData
## GRanges with 3 ranges and 6 metadata columns:
##       seqnames               ranges strand |        name     score
##          <Rle>            <IRanges>  <Rle> | <character> <integer>
##   [1]    chr22 [18080140, 18080290]      * |           .         0
##   [2]    chr22 [18121120, 18121270]      * |           .         0
##   [3]    chr22 [18189420, 18189570]      * |           .         0
##       signalValue    pValue    qValue      peak
##         <numeric> <numeric> <numeric> <integer>
##   [1]          44        -1        -1        -1
##   [2]          31        -1        -1        -1
##   [3]         118        -1        -1        -1
##   ---
##   seqlengths:
##         chr1     chr10     chr11     chr12 ...      chr9      chrX      chrY
##    249250621 135534747 135006516 133851895 ... 141213431 155270560  59373566

Annotating the mouse ranges in the same fashion is left as an excercise.

Annotating Variants

Amino acid coding changes

For the set of dbSNP variants that fall in coding regions, amino acid changes can be computed. The output contains one line for each variant-transcript match which can result in multiple lines for each variant.

library(BSgenome.Hsapiens.UCSC.hg19)
head(predictCoding(vcf, txdb_hg19, Hsapiens), 3)

## GRanges with 3 ranges and 17 metadata columns:
##             seqnames               ranges strand | paramRangeID
##                <Rle>            <IRanges>  <Rle> |     <factor>
##   rs2236639    chr22 [17072483, 17072483]      - |         <NA>
##   rs5747988    chr22 [17073066, 17073066]      - |         <NA>
##   rs5748622    chr22 [17264565, 17264565]      - |         <NA>
##                        REF                ALT      QUAL      FILTER
##             <DNAStringSet> <DNAStringSetList> <numeric> <character>
##   rs2236639              A                  G      <NA>           .
##   rs5747988              A                  G      <NA>           .
##   rs5748622              G                  T      <NA>           .
##                  varAllele       CDSLOC    PROTEINLOC   QUERYID
##             <DNAStringSet>    <IRanges> <IntegerList> <integer>
##   rs2236639              C [ 958,  958]           320        34
##   rs5747988              C [ 375,  375]           125        35
##   rs5748622              A [1324, 1324]           442        99
##                    TXID     CDSID      GENEID   CONSEQUENCE       REFCODON
##             <character> <integer> <character>      <factor> <DNAStringSet>
##   rs2236639       74436    216505      150160 nonsynonymous            TGG
##   rs5747988       74436    216505      150160    synonymous            GCT
##   rs5748622       74439    216506      150165 nonsynonymous            CAC
##                   VARCODON         REFAA         VARAA
##             <DNAStringSet> <AAStringSet> <AAStringSet>
##   rs2236639            CGG             W             R
##   rs5747988            GCC             A             A
##   rs5748622            AAC             H             N
##   ---
##   seqlengths:
##    chr22
##       NA

ensemblVEP Package

The ensemblVEP package provides access to the online Ensembl Variant Effect Predictor (VEP tool). The VEP tool ouputs predictions of functional consequences of known and unknown variants as reported by Sequence Ontology or Ensembl. Regulatory region consequences, HGNC, Ensembl protein identifiers, HGVS, co-located variants are optional outputs. ensemblVEP() accepts the name of a VCF file and returns a VCF on disk or GRanges in the R workspace.

library(ensemblVEP)
fl <- system.file("extdata", "ex2.vcf", package = "VariantAnnotation")
gr <- ensemblVEP(fl)
head(gr, 3)

## GRanges with 3 ranges and 13 metadata columns:
##                seqnames             ranges strand |   Allele
##                   <Rle>          <IRanges>  <Rle> | <factor>
##      rs6054257       20 [  14370,   14370]      * |        A
##   20:17330_T/A       20 [  17330,   17330]      * |        A
##      rs6040355       20 [1110696, 1110696]      * |        T
##                           Gene         Feature Feature_type
##                       <factor>        <factor>     <factor>
##      rs6054257            <NA>            <NA>         <NA>
##   20:17330_T/A            <NA>            <NA>         <NA>
##      rs6040355 ENSG00000125818 ENST00000381898   Transcript
##                       Consequence cDNA_position CDS_position
##                          <factor>      <factor>     <factor>
##      rs6054257 intergenic_variant          <NA>         <NA>
##   20:17330_T/A intergenic_variant          <NA>         <NA>
##      rs6040355     intron_variant          <NA>         <NA>
##                Protein_position Amino_acids   Codons Existing_variation
##                        <factor>    <factor> <factor>           <factor>
##      rs6054257             <NA>        <NA>     <NA>               <NA>
##   20:17330_T/A             <NA>        <NA>     <NA>               <NA>
##      rs6040355             <NA>        <NA>     <NA>               <NA>
##                DISTANCE CELL_TYPE
##                <factor>  <factor>
##      rs6054257     <NA>      <NA>
##   20:17330_T/A     <NA>      <NA>
##      rs6040355     <NA>      <NA>
##   ---
##   seqlengths:
##    20
##    NA

Exercise 1: VCF header and reading data subsets.

VCF files can be large and it's often the case that only a subset of variables or genomic positions are of interest. The scanVcfHeader() function in the VariantAnnotation package retrieves header information from a VCF file. Based on the information returned from scanVcfHeader() a ScanVcfParam() object can be created to read in a subset of data from a VCF file. * Use scanVcfHeader() to inspect the header information in the 'chr22.vcf.gz' file in VariantAnnotation package. * Select a few 'info' or 'geno' variables and create a ScanVcfParam object. * Use the ScanVcfParam object as the 'param' argument to readVcf() to read in a subset of data. Note that the header() accessor operates on VCF objects in the R workspace. Try header(vcf) on the dbSNP file from AnnotationHub.

Exercise 2: Annotate the mouse ranges in 'gr_mm10' with AnnotationHub files. * Create a new 'hub' and filter on organism. * Isolate the files for the appropriate genome build and perform overlaps.

Exercise 3: Annotate a gene range from Saccharomyces Scerevisiae. * Load TxDb.Scerevisiae.UCSC.sacCer3.sgdGene and extract the gene ranges. (Hint: use transcriptsBy() and range()). * Isolate the range for gene "YBL086C". * Create a new 'hub' from AnnotationHub and filter by organism. (You should see >= 39 files.) * Select the files for 'sacCer3' and perform overlaps.

[ Back to top ]

Fred Hutchinson Cancer Research Center