Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{UCSC03 Lab 1} %\VignetteDepends{Biobase,tkWidgets,genefilter,golubEsets} %\VignetteKeywords{Basics} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}


\title{Lab 1: R and Bioconductor Basics}

\author{Sandrine Dudoit, Robert Gentleman and Katherine S. Pollard}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{R and Bioconductor WWW resources}

For software and documentation consult \begin{itemize} \item Main R project website: \url{}. \item Comprehensive R Archive Network (CRAN): \url{}.\\ Base system and contributed packages (Linux, MacOS, Windows), manuals, tutorials, R News.\\ For Windows, there is an installer for the main R system. Contributed packages from CRAN can be installed using the "Install package from CRAN ..." item on the "Packages" menu. For other packages, you can use the "Install package from local zip file ..." item. \item Bioconductor website: \url{}.\\ Software packages, vignettes, datasets, short course materials.\\ To install Bioconductor packages, use the install script (\texttt{getBioC.R}) provided under the "HowTo" link of the "Software" section of the website. \end{itemize}

The labs for this short course are included in the \texttt{UCSC03} R package. They are executable documents that mix text and code, and are created and run using the Sweave system (\url{}). We will use the \texttt{vExplorer} function from the \texttt{tkWidgets} package to step through the labs interactively. After loading the \texttt{UCSC03} package, using the command \texttt{library(UCSC03)}, type \texttt{vExplorer()} and select the \texttt{UCSC03} package using the widget.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started}

Some useful commands for getting help and sample scripts demonstrating software functionality

<>= help.start() apropos("mean") ? mean example("mean") @

The functions \texttt{getwd} and \texttt{setwd} are used to get or set the working directory. In Windows, you can also use the "Change dir ..." item in the "File" menu. <>= getwd() @

To load packages, use the function \texttt{library} (or the "Load package ..." item on the "Packages" menu in Windows) <>= library(Biobase) library(tkWidgets) library(genefilter) @

The functions \texttt{save} and \texttt{save.image} are used to write external representations of R objects to a specified file. The objects can be read back from the file (\texttt{.RData}) at a later date by using the \texttt{load} function. \texttt{save.image()} is called when you answer "yes" when exiting R using the command \texttt{q()}. In Windows, you can use the "Load Workspace... ", "Save Workspace... ", and "Exit" items in the "File" menu.\\

Each Bioconductor package contains a {\em vignette}, which is an executable document providing a step-by-step overview of the package functionality. These documents are created using the \texttt{Sweave} function from the \texttt{tools} package. You can access the vignettes (\texttt{.Rnw} source and \texttt{.pdf} output) for a given package under the "Accompanying documentation" section of the HTML help page for the package. These files are also available in the \texttt{"doc"} subdirectory of the installed package. To view a list all available vignettes use \texttt{openVignette()}. %%%<>= %%%openVignette() %%%@ Selecting a vignette ID will display the specified PDF file. The \texttt{vExplorer} function provides a graphical interface for viewing and executing code chunks from vignettes. We will use this function for the labs. %%%<>= %%%vExplorer() %%%@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Bioconductor base package: \texttt{Biobase}}

\texttt{Biobase} is the base package of Bioconductor. It contains the main {\em class/method} definitions for handling microarray and other types of genomic data. It also contains any reusable (or non-specific) functions needed by other packages.\\

The main class for (pre-processed) microarray data objects is \texttt{exprSet}. The S4 \texttt{methods} package has introduced substantial new capabilities into R. To obtain the manual pages for S4 classes you should use the following syntax \texttt{class?exprSet}. Consult the help file for a description of the {\em slots} of the \texttt{exprSet} class and basic methods which operate on objects of this class.

<>= slotNames("exprSet") @ An \texttt{exprSet} basically consists of the genes $\times$ arrays matrix of expression measures, optionally a set of standard errors for those estimates, the related experimental metadata (who did what, when, and to what), and the phenotypic data. Here, phenotype is interpreted quite broadly -- it represents any characteristics of the target sample (e.g., for tumor mRNA samples, patient survival, age, sex, treatment).\\

The ALL/AML leukemia dataset of Golub et al. (1999) will be used as a case study. For more background on the experiments: \url{}. The data are available from the authors in an online repository (\url{}) and are included in an R data package, \texttt{golubEsets}, suitable for use in this lab. This dataset comes from a study of gene expression in two types of acute leukemias: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high-density oligonucleotide arrays (HU6800 chip) containing probes for approximately $6,800$ human genes and ESTs. The chip actually contains 7,129 different probe sets; some of these map to the same genes and others are used for quality control purposes. The data comprise 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 cases of AML. Samples are divided into a learning set with 38 observations and a test set of 34 observations. Here, we assume that the data have been suitably pre-processed, i.e., image analysis, normalization, and computation of expression measures were already performed. You are referred to the \texttt{affy} package from Bioconductor for relevant software and documentation for pre-processing Affymetrix data, and to the \texttt{marray} suite for pre-processing two-color spotted cDNA array data.\\

For a description of the Golub et al. (1999) dataset type \texttt{? golubTrain} and to load these data <>= library(golubEsets) data(golubTrain) data(golubTest) data(golubMerge) @

The microarray expression measures and clinical variables corresponding to each of the 38 training mRNA samples are stored in an object of class \texttt{exprSet}. For information on the training data \texttt{golubTrain} <>= class(golubTrain) slotNames(golubTrain) golubTrain @

The phenotypic data are stored in a separate, but linked, object of class \texttt{phenoData}. An object of class \texttt{phenoData} is a combination of a data frame containing a number of variables for each array and a list that explains what each variable represents. This information is usually relegated to a help page but we felt that it was important to keep it more closely associated with the expression measures. You can obtain and manipulate the \texttt{phenoData} object corresponding to a particular \texttt{exprSet} object using specific methods, as described in the manual page. To extract only the sample level variables <>= phenoTrain<-phenoData(golubTrain) class(phenoTrain) slotNames(phenoTrain) varLabels(phenoTrain) pData(phenoTrain) @ The \texttt{\$} operator can be used to extract particular variables from an object of class \texttt{phenoData}. It also can be used directly on the \texttt{exprSet} instance. <>= table(phenoTrain$ALL.AML) table(golubTest$ALL.AML) @

Data on only the first 10 genes in the first 3 chips can be obtained using the subsetting operator \texttt{"["} <>= golubTrain[1:10,1:3] @ Notice that when subsetting, we have arranged it so that the \textit{rows} correspond to genes and the \textit{columns} correspond to samples.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Gene filtering package: \texttt{genefilter}}

The \texttt{genefilter} package provides functions for sequentially applying filters to the rows (genes) of a matrix or of an instance of the \texttt{exprSet} class. For instance, with the Golub et al. (1999) dataset, the following identifies the genes with absolute intensity greater than 10000 in at least 10 of the 38 arrays in the training set.

<>= fg <- kOverA(k=10, A=10000) flist <- filterfun(fg) ans <- genefilter(golubTrain, flist) sum(ans) @




BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.