\documentclass{article}
\title{Example of a Statistical Analysis}
\author{James W. MacDonald}
\usepackage{hyperref}
\parindent=0.25in
\usepackage[authoryear,round]{natbib}
\usepackage{times}
\usepackage{comment}
\bibliographystyle{plainnat}
\begin{document}
\maketitle
<>=
library(affycoretools)
library(KEGG)
library(xtable)
library(affyPLM)
## make AnnotatedDataFrame
pd <- read.AnnotatedDataFrame(paste(system.file("examples",package="affycoretools"),
"/pdata.txt", sep=""), header = TRUE, row.names = 1)
## no celfiles in package any more, fake this step
#dat <- ReadAffy()
#eset <- rma(dat)
load(paste(system.file("examples", package="affycoretools"),
"/abatch.Rdata", sep=""))
load(paste(system.file("examples", package="affycoretools"),
"/exprSet.Rdata", sep=""))
## load annotation package
options(show.error.messages = FALSE)
a <- try(do.call("library", list(annotation(eset))))
options(show.error.messages = TRUE)
if(inherits(a, "try-error")){
source("http://www.bioconductor.org/biocLite.R")
biocLite(annotation(eset))
do.call("library", list(annotation(eset)))
}
type <- rep(LETTERS[1:4], each = 3)
@
This analysis is based on microarrays that were processed in the microarray facility
in August 2004.
Filenames and samples were as follows:
\SweaveOpts{echo = false}
<>=
## Put a table of filenames and sample types in PDF
type <- rep(LETTERS[1:4], each = 3)
tmp <- data.frame(sampleNames(eset), type)
names(tmp) <- c("Filenames","Samples")
xtable(tmp)
@
The goal of this analysis is to see if there is a difference between the A and B
samples, as well as between the C and D samples.
The first step in my analysis was to make some quality control plots that can
be used to check the overall quality of the raw data.
\begin{figure}
\centering
<>=
plotHist(dat, sampleNames(eset))
plotHist(dat[,1:6])
plotHist(dat[,7:12])
@
\caption{Plot of perfect match (PM) chip densities}
\label{fig:digest}
\end{figure}
Figure~\ref{fig:digest} shows the distribution of the PM probes for each chip.
One of the underlying assumptions for the normalization procedure I use is that
the PM probe data all come from the same distribution, with the only differences
being the location and scale. Basically, this means that we want the shape of the
curves to be very similar, and we want the curves to be fairly
close to each other. There appears to be a large difference between the A/B and
C/D samples that may have an impact on our analysis. Since we are only concerned
with the comparison of A/B and C/D it might make sense to pre-process these samples
separately.
\begin{figure}
\centering
<>=
plotDeg(dat, sampleNames(eset))
plotDeg(dat[,1:6])
plotDeg(dat[,7:12])
@
\caption{RNA degradation plot}
\label{fig:deg}
\end{figure}
Figure~\ref{fig:deg} is designed to show differences between samples due to mRNA
degradation or from the \emph{in vitro} translation step. Any differences between
samples will be indicated by a different slope. Again, the only differences are
between the two sample sets. These two plots indicate that we should probably
process each set of samples separately and then combine later.
\begin{figure}
\centering
<>=
boxplot(data.frame(exprs(eset)))
@
\caption{Boxplot of Expression values from both sets}
\label{fig:box}
\end{figure}
I calculated expression values for each gene using a robust multi-array average (RMA)
\citet{Irizarry2003}.
This is a modeling strategy that converts the PM probe values into an expression
value for each gene. Note that the expression values are $log_2$ transformed data. These
data can be converted to the natural scale by exponentiating (e.g., convert by using
$2^x$, where \emph{x} is the expression value). Figure~\ref{fig:box} shows a boxplot
of the expression values for each set of data. It appears here that the data are
fairly well normalized.
I then fit a principal components analysis (PCA) on the
expression values and then plotted the first two principal components (PCs). PCA can
be used to visualize the overall structure of high dimensional data; in this case, we
are using it to see if the replicated samples are grouping together, which would
indicate that the replicates are relatively similar in their gene expression profiles.
\begin{figure}
\centering
<>=
plotPCA(eset, groups = rep(1:4, each = 3),
groupnames = unique(paste(pData(pd)[,1], pData(pd)[,2], sep = "-")))
plotPCA(eset[,1:6], groups=rep(1:2, each=3),
groupnames=unique(paste(pData(pd)[1:4,1], pData(pd)[1:4,2], sep="-")))
plotPCA(eset[,7:12], groups=rep(1:2, each=3),
groupnames=unique(paste(pData(pd)[7:10,1], pData(pd)[7:10,2], sep="-")))
@
\caption{PCA plot}
\label{fig:pca}
\end{figure}
Figure~\ref{fig:pca} shows the PCA plot. Here again we can see that
there is a very large difference between the A/B and C/D samples.
The PCA plot indicates that the replicated samples are quite similar,
but doesn't tell us much about the quality of the RMA model fit. For
this we can do boxplots of the normalized unscaled standard errors
(NUSE) from the model fit. Any chip that has large standard errors in
comparison to the others is probably of lower quality, as the model
isn't fitting the data from that chip very well.
\begin{figure}
\centering
<>=
library(affyPLM)
pset <- fitPLM(dat[,7:12])
boxplot(pset, xaxt="n", main="NUSE")
axis(1, at=7:12, label=paste("Sample", 1:6), las=2, cex.axis=0.8)
@
\caption{NUSE plot}
\label{fig:nuse}
\end{figure}
Figure~\ref{fig:nuse} shows the NUSE plot for the first six chips. The
standard errors are all quite similar for each chip, indicating that
the model fits the data similarly on each chip.
One last QA plot that might be of interest is a relative log
expression (RLE) plot. For this plot we compute the relative log
expression for each probeset on each chip, relative to the median
value for that probeset. Any chip that is very different from the
others in this plot is typically of lower quality.
\begin{figure}
\centering
<>=
Mbox(pset, xaxt="n", main="RLE")
axis(1, at=1:6, label=paste("Sample", 1:6), las=2, cex.axis=0.8)
@
\caption{NUSE plot}
\label{fig:nuse}
\end{figure}
<>=
## filter data to remove probesets that aren't changing
index <- apply(exprs(eset)[,1:6], 1, var) > 0.01
eset1 <- eset[index,]
## create design matrix and give reasonable column names
design <- model.matrix(~ 0 + factor(rep(1:4,each=3)))
colnames(design) <- LETTERS[1:4]
@
<>=
## set up a contrasts matrix using makeContrasts()
contrast <- makeContrasts(A - B, C - D, levels = design)
## now do the same using matrix()
contrast <- matrix(c(1,-1,0,0,0,0,1,-1), ncol=2,
dimnames=list(unique(type), paste(type[c(1,7)],type[c(4,10)],
sep=" vs ")))
@
<>=
## fit model
fit <- lmFit(eset1, design)
fit2 <- contrasts.fit(fit, contrast)
## empirical Bayes step
fit2 <- eBayes(fit2)
@
<>=
## output text and HTML tables using limma2annaffy()
out <- limma2annaffy(eset1, fit2, design, contrast, annotation(eset),
pfilt = 0.05, fldfilt = 1, save = TRUE, text = TRUE,
interactive = FALSE)
@
<>=
## output text and HTML tables using limma2biomaRt()
## this is usually more applicable to analyses of chips that don't have BioC
## annotation packages
## Change mysql to FALSE if you don't have RMySQL installed
out2 <- limma2biomaRt(eset1, fit2, design, contrast, "hsapiens",
ann.source="affy_hg_focus", pfilt=0.05, fldfilt=1,
addname="biomart", affyid=TRUE, mysql=TRUE, interactive=FALSE)
@
Prior to making comparisons, I filtered the genes to remove any that do not appear to be
differentially expressed in any samples, based on at least samples having an
expression value of or greater. This resulted in a total of \Sexpr{sum(index)}
genes. I then made the requested comparisons using a modeling strategy developed for microarray
analyses (\citet{Smyth:2004}), selecting those probesets with an adjusted
$p$-value less than 0.05 and a two-fold difference (adjusting for multiple comparisons using
false discovery rate (\citet{Benjamini:1995})). This resulted in the following number
of probesets:
<>=
a <- data.frame(paste(unique(type)[c(1,3)], unique(type)[c(2,4)], sep = " vs "),
sapply(out, function(x) dim(x)[1]))
names(a) <- c("Comparisons","Probesets")
xtable(a, digits = rep(0, 3))
@
I output these data in HTML and text tables that include various statistics, as well
as annotation for the different probesets. I also output all the expression values in a
text table that can be opened using a spreadsheet and used for ongoing analyses, or to
look for probesets that might not appear in the HTML tables.
Please note that I used the affy, and limma packages of Bioconductor for this analysis.
If you publish these results, please use the following citations.
\bibliography{BioC2007}
\end{document}