lab2Annot.Rnw

% NOTE: ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{UCSC03 Lab 2} %\VignetteDepends{Biobase,annotate,tkWidgets,geneplotter,golubEsets,hu6800,GO} %\VignetteKeywords{Annotation} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat}

\title{Lab 2: Introduction to Bioconductor Annotation Packages}

\author{Sandrine Dudoit, Robert Gentleman, and Katherine S. Pollard}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document}

\maketitle

One of the largest challenges in analyzing genomic data is associating the experimental data with the available biological {\em metadata}, e.g., sequence, gene annotation, chromosomal maps, literature. Bioconductor provides two main packages for this purpose: \texttt{annotate} (end-user) and \texttt{AnnBuilder} (developer). These packages can be used for querying databases such as GenBank, GO, LocusLink, and PubMed, from R, and for processing the query results in R.\\

In this lab, we focus on annotation resources for Affymetrix chips. Since Affymetrix chips have a standard layout, Bioconductor can provide annotation data packages for the main types of Affymetrix chips, e.g., hu6800, hgu33, hgu95, mgu74, and rgu34 series. These data packages are built using the \texttt{AnnBuilder} package. By contrast, there is no standard array design for two-color spotted microarray experiments; each lab tends to have its own custom design. In this case, specific annotation data packages will have to be created for each facility using \texttt{AnnBuilder}. Once annotation data packages are constructed to provide mappings between different sets of gene identifiers, the tools in \texttt{annotate} can be used in a similar manner for both platforms.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Mappings between different gene identifiers}

A main task in annotation is to associate manufacturer (e.g., for Affymetrix chips) or in-house (e.g., for custom cDNA spotted arrays) probe identifiers to other available identifiers (e.g., PMID, GenBank accession number).\\

In this lab, we will use the Bioconductor annotation data packages to map between Affymetrix identifiers and identifiers for biological metadata available on the WWW. Because these data sources are very large, are constantly evolving, and are similar across species, we have adopted the strategy of distributing data in regularly updated R packages. Each mapping is contained in an {\em environment}. For our purposes, an environment is simply a {\em hash table}. It provides a very fast way of looking up {\em values} for specific text {\em keys}. This implementation is not the best and we hope to develop real hash tables for R but that will take some time.\\

We will again use the Golub et al. (1999) dataset as a case study and will explore annotation resources for the Affymetrix HU6800 chip used in this study. The Bioconductor annotation data package for the HU6800 chip is \texttt{hu6800} and can be downloaded from the "Data packages" section of the Bioconductor website. To load the packages needed for this lab <>= library(Biobase) library(annotate) library(tkWidgets) library(geneplotter) library(golubEsets) library(hu6800) library(GO) @

Alternately, since all these packages are required by the \texttt{UCSC03} course package, simply use \begin{verbatim} >library(UCSC03) \end{verbatim}

After loading the \texttt{hu6800} package, we have access to a number of different datasets (environments). These provide the many different mappings from Affymetrix identifiers to other probe identifiers, such as chromosomal location (in the \texttt{hu6800CHR} environment) and PMID (in the \texttt{hu6800PMID} environment). Quality control data, counts, etc., for the mappings are available by calling the function \texttt{hu6800()} and by examining the help page, \texttt{?hu6800}.

The main R functions we will use for dealing with environments are \texttt{ls} (to list the names of objects in a specified environment) and \texttt{get} (to search for an R object with a given name and return its value, if found, in the specified environment). We obtain the Affymetrix identifiers using \texttt{ls} and note that there are 7,129 identifiers for the HU6800 chip (this matches the Golub \texttt{exprSets}). <>= affyID <- ls(env=hu6800CHR) length(affyID) @

We now arbitrarily select the probe set with Affymetrix identifier \texttt{"U18237\_at"} and obtain a number of other IDs corresponding to this probe set. Note that there may be several PMIDs, i.e., PubMed abstracts, associated with a given gene.

<>= mygene <- affyID[4001] mygene get(mygene, env = hu6800ACCNUM) get(mygene, env = hu6800LOCUSID) get(mygene, env = hu6800SYMBOL) get(mygene, env = hu6800GENENAME) get(mygene, env = hu6800SUMFUNC) get(mygene, env = hu6800UNIGENE) get(mygene, env = hu6800CHR) get(mygene, env = hu6800CHRLOC) get(mygene, env = hu6800MAP) get(mygene, env = hu6800PMID) get(mygene, env = hu6800GO) @

In some cases, we need to obtain data on several genes at once. We wrote a special function for this purpose: \texttt{multiget}. <>= fivegenes<-affyID[6:10] fivegenes multiget(fivegenes, env = hu6800PMID) @

Instead of relying on the general R functions for environments (e.g., \texttt{get}), the development version of the \texttt{annotate} package also provides new and more user-friendly functions for accessing specific identifiers. <>= getSYMBOL(fivegenes, data="hu6800") getLL(fivegenes, data="hu6800") getPMID(fivegenes, data="hu6800") gg<- getGO(fivegenes, data="hu6800") getGOdesc(gg[[2]], "MF") @

A variety of information about the probes can now easily be obtained by, for example, querying PubMed, GenBank, and LocusLink, as described below. In addition, one can also use R to compute on these data and perform a number of quality control and exploratory analyses. <>= whChrom <- multiget(affyID, env=hu6800CHR) table(unlist(whChrom)) vv<-sapply(whChrom, length) table(vv) whChrom[vv==2] @

This yields the distribution of probe sets by chromosome. Note that there are 10 genes that have been assigned two chromosome locations. Based on OMIM the these genes are localized to the so called \textit{pseudoautosomal region} where the X and Y cromosomes are similar and there is actual recombination going on between them. For now we can just put them all on the X chromosome (they are all X,Y).

<>= vv2 <- sapply(whChrom, function(x) x[1]) vv2 <- factor(vv2) length(vv2) table(unlist(vv2)) @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Querying PubMed}

The National Library of Medicine (NLM) provides a great deal of information on biological data. We have only just started to explore these resources. We have already developed tools for interacting with PubMed, but need to develop further functionality on MeSH and other available resources. A mapping has been done between LocusLink and PubMed. This associates specific articles with specific genes. The articles can be queried interactively using tools available in R and other systems. Full text abstracts are readily available; in some cases full text articles are also available (we don't have the tools to make use of the later). We next demonstrate how to interact with PubMed using functions in \texttt{annotate}.\\

Bioconductor functions for querying PubMed and other WWW databases from R rely on the \texttt{browseURL} function and the R \texttt{XML} package to parse query results. For more details, please consult articles in R News (Gentleman \& Gentry (2002), {\it R News} 2(2); Temple Lang (2001), {\it R News} 1(1)).

<>= browseURL("www.r-project.org") @

A \texttt{pubMedAbst} class structure was defined for handling PubMed abstracts in R. The slots are <>= slotNames("pubMedAbst") @

The basic engine for talking to PubMed is the function \texttt{pubmed}. Given a vector of PMIDs, the function either has a browser display a URL showing the results of the PubMed query for those identifiers (\texttt{disp="browser"}) or creates an \texttt{XMLdoc} object with the same data (\texttt{disp="data"}). <>= pmids<-getPMID(mygene,data="hu6800") pubmed(pmids, disp="browser") absts1<-pubmed(pmids, disp="data") @

The function \texttt{pm.getabst} provides a simpler way to download the specified PubMed abstracts (stored in XML) and create a list of \texttt{pubMedAbst} objects. The following commands can be used to store the PubMed abstracts for 5 genes in a list of objects of class \texttt{pubMedAbst}, to compute the number of abstracts retrieved for each gene, and to print the abstract(s) for the first gene. We can see, for example, that one of the genes has 21 abstracts associated with it. <>= absts2<-pm.getabst(fivegenes,"hu6800") lapply(absts2,length) absts2[[1]] @

The functions \texttt{pm.titles} and \texttt{pm.abstGrep} can then be used to extract the titles from a set of PubMed abstracts and for regular expression matching on these abstracts, respectively. For the genes with Affy IDs \texttt{fivegenes}, the following commands extract the abstract titles and search the abstracts for the word "protein", with lower case "p" or upper case "P". <>= pm.titles(absts2) sapply(absts2, function(x) pm.abstGrep("[Pp]rotein", x)) @

The new function \texttt{pmAbst2html} from the development version of \texttt{annotate} takes a list of \texttt{pubMedAbst} objects and generates an HTML report with the titles of the abstracts and links to their full page on PubMed. The report will be stored in the file \texttt{pm.html} in the working directory.

%%%***<>= \begin{Sinput} > pmAbst2html(absts2[[2]],filename="pm.html") \end{Sinput} %%%@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Querying GenBank}

Another source of data is GenBank. We can interact with GenBank in much the same way as with PubMed; the main function for this purpose is \texttt{genbank}. The argument \texttt{disp="data"} returns an XMLDoc, while \texttt{disp="browser"} displays information in the user's browser.

<>= gbacc<-multiget(fivegenes,hu6800ACCNUM) genbank(gbacc,disp="browser") @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Querying LocusLink}

Finally, we consider interactions with LocusLink to obtain sequence and other biological information on probe sets of interest.

<>= llid<-getLL(fivegenes, data="hu6800") locuslinkByID(llid) @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{HTML reports}

The function \texttt{ll.htlmpage} generates HTML reports, with one row per gene and a clickable entry which opens the LocusLink webpage for that gene. Additional information can be displayed using the \texttt{othernames} argument as shown below. Such HTML reports provide an easy way to communicate results with other researchers working on the same analysis.

<>= tengenes<- affyID[1:10] llid<-getLL(tengenes, data="hu6800") symb<-getSYMBOL(tengenes, data="hu6800") map<-multiget(tengenes,hu6800MAP) res<-data.frame(tengenes,cbind(unlist(symb),unlist(map))) names(res)<-c("Affy ID","Gene symbol", "Chromosomal location") ll.htmlpage(llid, filename="ll.html", title="HTML report for 10 genes", othernames=res, table.head=c("LocusID",names(res)), table.center = TRUE) @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Plotting genomic data}

The classes \texttt{chromLoc} and \texttt{chromLocation} are used to keep track of location information for a single gene and a set of genes (entire genome), respectively. We will use the \texttt{geneplotter} functions to create an instance of the class \texttt{chromLocation} for the Affymetrix HU6800 chip.

<>= slotNames("chromLoc") slotNames("chromLocation") strand<- multiget(affyID, env=hu6800CHRLOC) splits<- split(strand, vv2) length(splits) names(splits) hu6800ChrClass<- buildChromLocation("hu6800") organism(hu6800ChrClass) chromNames(hu6800ChrClass) @

The \texttt{geneplotter} package provides two main functions for plotting genomic data: \texttt{cPlot} and \texttt{alongChrom}. These functions operate on instances of the class \texttt{chromLocation}. With \texttt{cPlot}, we can render all chromosomes the same length or we can scale them by their relative lengths. Note that the \texttt{mva} package has functions for heat maps and dendrograms.

\newpage

<>= cPlot(hu6800ChrClass,scale="relative") @

\newpage

<>= cPlot(hu6800ChrClass,scale="max") @

\newpage

<>= data(golubTrain) cols<-as.numeric(pData(golubTrain)$ALL.AML)+1 alongChrom(golubTrain,chrom="1",specChrom=hu6800ChrClass,col=cols) @

\newpage

<>= alongChrom(golubTrain,chrom="22",specChrom=hu6800ChrClass,plotFormat="image") @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\end{document}

bioconductor.org

Sections

lab2Annot.Rnw

Navigation

News