beadarray
is a package for the pre-processing and analysis of Illumina BeadArray. The main advantage is being able to read raw data created by Illumina’s scanning software. Data created in this manner are in the same format regardless of the assay (i.e expression, genotyping, methylation) being performed. Thus, beadarray is able to handle all these types of data. Many functions within beadarray have been written to cope with this flexibility.
The BeadArray technology involves randomly arranged arrays of beads, with beads having the same probe sequence attached colloquially known as a bead-type. BeadArrays are combined in parallel on either a rectangular chip (BeadChip) or a matrix of 8 by 12 hexagonal arrays (Sentrix Array Matrix or SAM). The BeadChip is further divided into strips on the surface known as sections, with each section giving rise to a different image when scanned by BeadScan. These images, and associated text files, comprise the raw data for a beadarray analysis. However, for BeadChips, the number of sections assigned to each biological sample may vary from 1 on HumanHT12 chips, 2 on HumanWG6 chips or sometimes ten or more for SNP chips with large numbers of SNPs being investigated.
This vignette demonstrates the analysis of bead summary data using beadarray. The recommended approach to obtain these data is to start with bead-level data and follow the steps illustrated in the vignette beadlevel.pdf
distributed with beadarray
. If bead-level data are not available, the output of Illumina’s BeadStudio or GenomeStudio can be read by beadarray
. Example code to do this is provided at the end of this vignette. However, the same object types are produced from either of these routes and the same functionality is available.
To make the most use of the code in this vignette, you will need to install the beadarrayExampleData
and illuminaHumanv3.db
packages from Bioconductor
. We use the BiocManager
package to install these:
install.packages("BiocManager")
BiocManager::install(c("beadarrayExampleData", "illuminaHumanv3.db"))
The code used to produce these example data is given in the vignette of beadarrayExampleData
, which follow similar steps to those described in the beadlevel.pdf
vignette of beadarray
. The following commands give a basic description of the data.
library("beadarray")
require(beadarrayExampleData)
data(exampleSummaryData)
exampleSummaryData
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 12 samples
## element names: exprs, se.exprs, nObservations
## protocolData: none
## phenoData
## rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
## varLabels: sampleID SampleFac
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
## total)
## fvarLabels: ArrayAddressID IlluminaID Status
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3
## QC Information
## Available Slots:
## QC Items: Date, Matrix, ..., SampleGroup, numBeads
## sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A
Summarized data are stored in an object of type ExpressionSetIllumina
which
is an extension of the ExpressionSet
class developed by the Bioconductor team as a container for data from high-throughput assays. Objects of this type use a series of slots to store the data. For consistency with the definition of other ExpressionSet
objects, we refer to the expression values as the exprs
matrix (this stores the probe-specific average intensities) which can be accessed using exprs
and subset in the usual manner. The se.exprs
matrix, which stores the probe-specific variability can be accessed using se.exprs
. You may notice that the expression values have already been transformed to the log\(_2\) scale, which is an option in the summarize
function in beadarray
. Data exported from BeadStudio or GenomeStudio will usually be un-transformed and on the scale \(0\) to \(2^{16}\).
exprs(exampleSummaryData)[1:5,1:5]
## G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380 8.454468 8.616796 8.523001 8.420796
## ILMN_1893287 5.388161 5.419345 5.162849 5.133287
## ILMN_1736104 5.268626 5.457679 5.012766 4.988511
## ILMN_1792389 6.767519 7.183788 6.947624 7.168571
## ILMN_1854015 5.556947 5.721614 5.595413 5.520391
## G:4616443093_B
## ILMN_1802380 8.527748
## ILMN_1893287 5.221987
## ILMN_1736104 5.284026
## ILMN_1792389 7.386435
## ILMN_1854015 5.558717
se.exprs(exampleSummaryData)[1:5,1:5]
## G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380 0.2833023 0.3367157 0.2750020 0.4141796
## ILMN_1893287 0.3963681 0.3882834 0.5516421 0.6761106
## ILMN_1736104 0.4704854 0.4951260 0.4031143 0.5276266
## ILMN_1792389 0.4038533 0.4728013 0.5032908 0.3447242
## ILMN_1854015 0.5663066 0.3783570 0.5511991 0.5358812
## G:4616443093_B
## ILMN_1802380 0.3581862
## ILMN_1893287 0.4448673
## ILMN_1736104 0.4864355
## ILMN_1792389 0.3951935
## ILMN_1854015 0.6748219
The fData
and pData
functions are useful shortcuts to find more information about the features (rows) and samples (columns) in the summary object. These annotations are created automatically whenever a bead-level data is summarized (see beadlevel.pdf
) or read from a BeadStudio file. The fData
will be added to later, but initially contains information on whether each probe is a control or not. In this example the phenoData
denotes the sample group for each array; either Brain or UHRR (Universal Human Reference RNA).
head(fData(exampleSummaryData))
## ArrayAddressID IlluminaID Status
## ILMN_1802380 10008 ILMN_1802380 regular
## ILMN_1893287 10010 ILMN_1893287 regular
## ILMN_1736104 10017 ILMN_1736104 regular
## ILMN_1792389 10019 ILMN_1792389 regular
## ILMN_1854015 10020 ILMN_1854015 regular
## ILMN_1904757 10021 ILMN_1904757 regular
table(fData(exampleSummaryData)[,"Status"])
##
## biotin cy3_hyb
## 2 2
## cy3_hyb,low_stringency_hyb housekeeping
## 4 7
## labeling low_stringency_hyb
## 2 4
## negative regular
## 759 48796
pData(exampleSummaryData)
## sampleID SampleFac
## 4613710017_B 4613710017_B UHRR
## 4613710052_B 4613710052_B UHRR
## 4613710054_B 4613710054_B UHRR
## 4616443079_B 4616443079_B UHRR
## 4616443093_B 4616443093_B UHRR
## 4616443115_B 4616443115_B UHRR
## 4616443081_B 4616443081_B Brain
## 4616443081_H 4616443081_H Brain
## 4616443092_B 4616443092_B Brain
## 4616443107_A 4616443107_A Brain
## 4616443136_A 4616443136_A Brain
## 4616494005_A 4616494005_A Brain
There are various way to subset an expressionSetIllumina
object, each of which returns an ExpressionSetIllumina
with the same slots, but different dimensions. When bead-level data are summarized by beadarray
there is an option to apply different transformation options, and save the results as different channels in the resultant object. For instance, if summarizing two-colour data one might be interested in summarizing the red and green channels, or some combination of the two, separately. Both log\(_2\) and un-logged data are stored in the exampleSummaryData
object and can be accessed by using the channel
function. Both the rows and columns in the resultant ExpressionSetIllumina
object are kept in the same order.
channelNames(exampleSummaryData)
## [1] "G" "G.ul"
exampleSummaryData.log2 <- channel(exampleSummaryData, "G")
exampleSummaryData.unlogged <- channel(exampleSummaryData, "G.ul")
sampleNames(exampleSummaryData.log2)
## [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
## [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
sampleNames(exampleSummaryData.unlogged)
## [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
## [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
exprs(exampleSummaryData.log2)[1:10,1:3]
## 4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380 8.454468 8.616796 8.523001
## ILMN_1893287 5.388161 5.419345 5.162849
## ILMN_1736104 5.268626 5.457679 5.012766
## ILMN_1792389 6.767519 7.183788 6.947624
## ILMN_1854015 5.556947 5.721614 5.595413
## ILMN_1904757 5.421553 5.320500 5.522316
## ILMN_1740305 5.417821 5.623998 5.720007
## ILMN_1665168 5.321087 5.155455 4.967601
## ILMN_2375156 5.894207 6.076418 5.638877
## ILMN_1705423 5.426463 4.806624 5.357688
exprs(exampleSummaryData.unlogged)[1:10,1:3]
## 4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380 356.88235 396.46875 367.81481
## ILMN_1893287 40.85000 44.29167 38.42105
## ILMN_1736104 40.53333 46.50000 33.46154
## ILMN_1792389 112.90909 153.17647 122.65000
## ILMN_1854015 50.47059 53.26087 51.57143
## ILMN_1904757 41.45833 42.10000 49.92593
## ILMN_1740305 38.45455 51.50000 46.21429
## ILMN_1665168 42.38889 37.95000 30.46154
## ILMN_2375156 61.47368 72.73913 52.46154
## ILMN_1705423 42.38889 28.14286 38.62500
As we have seen, the expression matrix of the ExpressionSetIllumina
object can be subset by column or row, In fact, the same subset operations can be performed on the ExpressionSetIllumina
object itself. In the following code, notice how the number of samples and features changes in the output.
exampleSummaryData.log2[,1:4]
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 4 samples
## element names: exprs, se.exprs, nObservations
## protocolData: none
## phenoData
## rowNames: 4613710017_B 4613710052_B 4613710054_B 4616443079_B
## varLabels: sampleID SampleFac
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
## total)
## fvarLabels: ArrayAddressID IlluminaID Status
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3
## QC Information
## Available Slots:
## QC Items: Date, Matrix, ..., SampleGroup, numBeads
## sampleNames: 4613710017_B, 4613710052_B, 4613710054_B, 4616443079_B
exampleSummaryData.log2[1:10,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 10 features, 12 samples
## element names: exprs, se.exprs, nObservations
## protocolData: none
## phenoData
## rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
## varLabels: sampleID SampleFac
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1705423 (10 total)
## fvarLabels: ArrayAddressID IlluminaID Status
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3
## QC Information
## Available Slots:
## QC Items: Date, Matrix, ..., SampleGroup, numBeads
## sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A
The object can also be subset by a vector of characters which must correspond to the names of features (i.e. row names). Currently, no analogous functions is available to subset by sample.
randIDs <- sample(featureNames(exampleSummaryData), 1000)
exampleSummaryData[randIDs,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 1000 features, 12 samples
## element names: exprs, se.exprs, nObservations
## protocolData: none
## phenoData
## rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
## varLabels: sampleID SampleFac
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1761422 ILMN_1693702 ... ILMN_1660426 (1000 total)
## fvarLabels: ArrayAddressID IlluminaID Status
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3
## QC Information
## Available Slots:
## QC Items: Date, Matrix, ..., SampleGroup, numBeads
## sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A
Boxplots of intensity levels and the number of beads are useful for quality assessment purposes. beadarray
includes a modified version of the boxplot
function that can take any valid ExpressionSetIllumina
object and plot the expression matrix by default. For these examples we plot just a subset of the original exampleSummaryData
object using random row IDs.
boxplot(exampleSummaryData.log2[randIDs,])
The function can also plot other assayData
items, such as the number of observations.
boxplot(exampleSummaryData.log2[randIDs,], what="nObservations")
The default boxplot plots a separate box for each array, but often it is beneficial for compare expression levels between different sample groups. If this information is stored in the phenoData
slot it can be incorporated into the plot. The following compares the overall expression level between UHRR and Brain samples.
boxplot(exampleSummaryData.log2[randIDs,], SampleGroup="SampleFac")
In a similar manner, we may wish to visualize the differences between sample groups for particular probe groups. As a simple example, we look at the difference between negative controls and regular probes for each array. You should notice that the negative controls as consistently lower (as expected) with the exception of array 4616443081_B
.
boxplot(exampleSummaryData.log2[randIDs,], probeFactor = "Status")
Extra feature annotation is available from annotation packages in Bioconductor, and beadarray
includes functionality to extract these data from the annotation packages. The annotation
of the object must be set in order that the correct annotation package can be loaded. For example, the exampleSummaryData
object was generated from Humanv3
data so the illuminaHumanv3.db
package must be present. The addFeatureData
function annotates all features of an ExpressionSetIllumina
object using particular mappings from the illuminaHumanv3.db
package. To see which mappings are available you can use
the illuminaHumanv3()
function, or equivalent from other packages.
annotation(exampleSummaryData)
## [1] "Humanv3"
exampleSummaryData.log2 <- addFeatureData(exampleSummaryData.log2,
toAdd = c("SYMBOL", "PROBEQUALITY", "CODINGZONE", "PROBESEQUENCE", "GENOMICLOCATION"))
head(fData(exampleSummaryData.log2))
## Row.names ArrayAddressID IlluminaID Status SYMBOL
## ILMN_1802380 ILMN_1802380 10008 ILMN_1802380 regular RERE
## ILMN_1893287 ILMN_1893287 10010 ILMN_1893287 regular <NA>
## ILMN_1736104 ILMN_1736104 10017 ILMN_1736104 regular <NA>
## ILMN_1792389 ILMN_1792389 10019 ILMN_1792389 regular ARK2C
## ILMN_1854015 ILMN_1854015 10020 ILMN_1854015 regular <NA>
## ILMN_1904757 ILMN_1904757 10021 ILMN_1904757 regular <NA>
## PROBEQUALITY CODINGZONE
## ILMN_1802380 Perfect Transcriptomic
## ILMN_1893287 Bad Transcriptomic?
## ILMN_1736104 Bad Intergenic
## ILMN_1792389 Perfect Transcriptomic
## ILMN_1854015 Bad Intergenic
## ILMN_1904757 Perfect*** Transcriptomic?
## PROBESEQUENCE
## ILMN_1802380 GCCCTGACCTTCATGGTGTCTTTGAAGCCCAACCACTCGGTTTCCTTCGG
## ILMN_1893287 GGATTTCCTACACTCTCCACTTCTGAATGCTTGGAAACACTTGCCATGCT
## ILMN_1736104 TGCCATCTTTGCTCCACTGTGAGAGGCTGCTCACACCACCCCCTACATGC
## ILMN_1792389 CTGTAGCAACGTCTGTCAGGCCCCCTTGTGTTTCATCTCCTGCGCGCGTA
## ILMN_1854015 GCAGAAAACCATGAGCTGAAATCTCTACAGGAACCAGTGCTGGGGTAGGG
## ILMN_1904757 AGCTGTACCGTGGGGAGGCTTGGTCCTCTTGCCCCATTTGTGTGATGTCT
## GENOMICLOCATION
## ILMN_1802380 chr1:8412758:8412807:-
## ILMN_1893287 chr9:42489407:42489456:+
## ILMN_1736104 chr3:134572184:134572223:-
## ILMN_1792389 chr18:44040244:44040293:+
## ILMN_1854015 chr3:160827837:160827885:+
## ILMN_1904757 chr3:197872267:197872316:+
illuminaHumanv3()
## ####Mappings based on RefSeqID####
## Quality control information for illuminaHumanv3:
##
##
## This package has the following mappings:
##
## illuminaHumanv3ACCNUM has 31857 mapped keys (of 49576 keys)
## illuminaHumanv3ALIAS2PROBE has 63289 mapped keys (of 260933 keys)
## illuminaHumanv3CHR has 29550 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLENGTHS has 93 mapped keys (of 711 keys)
## illuminaHumanv3CHRLOC has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLOCEND has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL has 29154 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL2PROBE has 20997 mapped keys (of 40839 keys)
## illuminaHumanv3ENTREZID has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME has 3526 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME2PROBE has 967 mapped keys (of 975 keys)
## illuminaHumanv3GENENAME has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3GO has 26989 mapped keys (of 49576 keys)
## illuminaHumanv3GO2ALLPROBES has 19471 mapped keys (of 22446 keys)
## illuminaHumanv3GO2PROBE has 15200 mapped keys (of 18678 keys)
## illuminaHumanv3MAP has 29402 mapped keys (of 49576 keys)
## illuminaHumanv3OMIM has 21943 mapped keys (of 49576 keys)
## illuminaHumanv3PATH has 9180 mapped keys (of 49576 keys)
## illuminaHumanv3PATH2PROBE has 229 mapped keys (of 229 keys)
## illuminaHumanv3PMID has 29329 mapped keys (of 49576 keys)
## illuminaHumanv3PMID2PROBE has 439054 mapped keys (of 800658 keys)
## illuminaHumanv3REFSEQ has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3SYMBOL has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3UNIPROT has 27704 mapped keys (of 49576 keys)
##
##
## Additional Information about this package:
##
## DB schema: HUMANCHIP_DB
## DB schema version: 2.1
## Organism: Homo sapiens
## Date for NCBI data: 2015-Mar17
## Date for GO data: 20150314
## Date for KEGG data: 2011-Mar15
## Date for Golden Path data: 2010-Mar22
## Date for Ensembl data: 2015-Mar13
## ####Custom Mappings based on probe sequence####
## illuminaHumanv3ARRAYADDRESS()
## illuminaHumanv3NUID()
## illuminaHumanv3PROBEQUALITY()
## illuminaHumanv3CODINGZONE()
## illuminaHumanv3PROBESEQUENCE()
## illuminaHumanv3SECONDMATCHES()
## illuminaHumanv3OTHERGENOMICMATCHES()
## illuminaHumanv3REPEATMASK()
## illuminaHumanv3OVERLAPPINGSNP()
## illuminaHumanv3ENTREZREANNOTATED()
## illuminaHumanv3GENOMICLOCATION()
## illuminaHumanv3SYMBOLREANNOTATED()
## illuminaHumanv3REPORTERGROUPNAME()
## illuminaHumanv3REPORTERGROUPID()
## illuminaHumanv3ENSEMBLREANNOTATED()
If we suspect that a particular gene may be differentially expressed between conditions, we can subset the ExpressionSetIllumina
object to just include probes that target the gene, and plot the response of these probes against the sample groups. Furthermore, the different probes can be distinguished using the probeFactor
parameter.
ids <- which(fData(exampleSummaryData.log2)[,"SYMBOL"] == "ALB")
boxplot(exampleSummaryData.log2[ids,],
SampleGroup = "SampleFac", probeFactor = "IlluminaID")