Contents

beadarray is a package for the pre-processing and analysis of Illumina BeadArray. The main advantage is being able to read raw data created by Illumina’s scanning software. Data created in this manner are in the same format regardless of the assay (i.e expression, genotyping, methylation) being performed. Thus, beadarray is able to handle all these types of data. Many functions within beadarray have been written to cope with this flexibility.

The BeadArray technology involves randomly arranged arrays of beads, with beads having the same probe sequence attached colloquially known as a bead-type. BeadArrays are combined in parallel on either a rectangular chip (BeadChip) or a matrix of 8 by 12 hexagonal arrays (Sentrix Array Matrix or SAM). The BeadChip is further divided into strips on the surface known as sections, with each section giving rise to a different image when scanned by BeadScan. These images, and associated text files, comprise the raw data for a beadarray analysis. However, for BeadChips, the number of sections assigned to each biological sample may vary from 1 on HumanHT12 chips, 2 on HumanWG6 chips or sometimes ten or more for SNP chips with large numbers of SNPs being investigated.

This vignette demonstrates the analysis of bead summary data using beadarray. The recommended approach to obtain these data is to start with bead-level data and follow the steps illustrated in the vignette beadlevel.pdf distributed with beadarray. If bead-level data are not available, the output of Illumina’s BeadStudio or GenomeStudio can be read by beadarray. Example code to do this is provided at the end of this vignette. However, the same object types are produced from either of these routes and the same functionality is available.

To make the most use of the code in this vignette, you will need to install the beadarrayExampleData and illuminaHumanv3.db packages from Bioconductor. We use the BiocManager package to install these:

install.packages("BiocManager")
BiocManager::install(c("beadarrayExampleData", "illuminaHumanv3.db"))

The code used to produce these example data is given in the vignette of beadarrayExampleData, which follow similar steps to those described in the beadlevel.pdf vignette of beadarray. The following commands give a basic description of the data.

library("beadarray")

require(beadarrayExampleData)

data(exampleSummaryData)

exampleSummaryData
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
##     total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

Summarized data are stored in an object of type ExpressionSetIllumina which is an extension of the ExpressionSet class developed by the Bioconductor team as a container for data from high-throughput assays. Objects of this type use a series of slots to store the data. For consistency with the definition of other ExpressionSet objects, we refer to the expression values as the exprs matrix (this stores the probe-specific average intensities) which can be accessed using exprs and subset in the usual manner. The se.exprs matrix, which stores the probe-specific variability can be accessed using se.exprs. You may notice that the expression values have already been transformed to the log\(_2\) scale, which is an option in the summarize function in beadarray. Data exported from BeadStudio or GenomeStudio will usually be un-transformed and on the scale \(0\) to \(2^{16}\).

exprs(exampleSummaryData)[1:5,1:5]
##              G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380       8.454468       8.616796       8.523001       8.420796
## ILMN_1893287       5.388161       5.419345       5.162849       5.133287
## ILMN_1736104       5.268626       5.457679       5.012766       4.988511
## ILMN_1792389       6.767519       7.183788       6.947624       7.168571
## ILMN_1854015       5.556947       5.721614       5.595413       5.520391
##              G:4616443093_B
## ILMN_1802380       8.527748
## ILMN_1893287       5.221987
## ILMN_1736104       5.284026
## ILMN_1792389       7.386435
## ILMN_1854015       5.558717
se.exprs(exampleSummaryData)[1:5,1:5]
##              G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380      0.2833023      0.3367157      0.2750020      0.4141796
## ILMN_1893287      0.3963681      0.3882834      0.5516421      0.6761106
## ILMN_1736104      0.4704854      0.4951260      0.4031143      0.5276266
## ILMN_1792389      0.4038533      0.4728013      0.5032908      0.3447242
## ILMN_1854015      0.5663066      0.3783570      0.5511991      0.5358812
##              G:4616443093_B
## ILMN_1802380      0.3581862
## ILMN_1893287      0.4448673
## ILMN_1736104      0.4864355
## ILMN_1792389      0.3951935
## ILMN_1854015      0.6748219

0.1 feature and pheno data

The fData and pData functions are useful shortcuts to find more information about the features (rows) and samples (columns) in the summary object. These annotations are created automatically whenever a bead-level data is summarized (see beadlevel.pdf) or read from a BeadStudio file. The fData will be added to later, but initially contains information on whether each probe is a control or not. In this example the phenoData denotes the sample group for each array; either Brain or UHRR (Universal Human Reference RNA).

head(fData(exampleSummaryData))
##              ArrayAddressID   IlluminaID  Status
## ILMN_1802380          10008 ILMN_1802380 regular
## ILMN_1893287          10010 ILMN_1893287 regular
## ILMN_1736104          10017 ILMN_1736104 regular
## ILMN_1792389          10019 ILMN_1792389 regular
## ILMN_1854015          10020 ILMN_1854015 regular
## ILMN_1904757          10021 ILMN_1904757 regular
table(fData(exampleSummaryData)[,"Status"])
## 
##                     biotin                    cy3_hyb 
##                          2                          2 
## cy3_hyb,low_stringency_hyb               housekeeping 
##                          4                          7 
##                   labeling         low_stringency_hyb 
##                          2                          4 
##                   negative                    regular 
##                        759                      48796
pData(exampleSummaryData)
##                  sampleID SampleFac
## 4613710017_B 4613710017_B      UHRR
## 4613710052_B 4613710052_B      UHRR
## 4613710054_B 4613710054_B      UHRR
## 4616443079_B 4616443079_B      UHRR
## 4616443093_B 4616443093_B      UHRR
## 4616443115_B 4616443115_B      UHRR
## 4616443081_B 4616443081_B     Brain
## 4616443081_H 4616443081_H     Brain
## 4616443092_B 4616443092_B     Brain
## 4616443107_A 4616443107_A     Brain
## 4616443136_A 4616443136_A     Brain
## 4616494005_A 4616494005_A     Brain

0.2 Subsetting the data

There are various way to subset an expressionSetIllumina object, each of which returns an ExpressionSetIllumina with the same slots, but different dimensions. When bead-level data are summarized by beadarray there is an option to apply different transformation options, and save the results as different channels in the resultant object. For instance, if summarizing two-colour data one might be interested in summarizing the red and green channels, or some combination of the two, separately. Both log\(_2\) and un-logged data are stored in the exampleSummaryData object and can be accessed by using the channel function. Both the rows and columns in the resultant ExpressionSetIllumina object are kept in the same order.

channelNames(exampleSummaryData)
## [1] "G"    "G.ul"
exampleSummaryData.log2 <- channel(exampleSummaryData, "G")
exampleSummaryData.unlogged <- channel(exampleSummaryData, "G.ul")


sampleNames(exampleSummaryData.log2)
##  [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
##  [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
sampleNames(exampleSummaryData.unlogged)
##  [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
##  [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
exprs(exampleSummaryData.log2)[1:10,1:3]
##              4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380     8.454468     8.616796     8.523001
## ILMN_1893287     5.388161     5.419345     5.162849
## ILMN_1736104     5.268626     5.457679     5.012766
## ILMN_1792389     6.767519     7.183788     6.947624
## ILMN_1854015     5.556947     5.721614     5.595413
## ILMN_1904757     5.421553     5.320500     5.522316
## ILMN_1740305     5.417821     5.623998     5.720007
## ILMN_1665168     5.321087     5.155455     4.967601
## ILMN_2375156     5.894207     6.076418     5.638877
## ILMN_1705423     5.426463     4.806624     5.357688
exprs(exampleSummaryData.unlogged)[1:10,1:3]
##              4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380    356.88235    396.46875    367.81481
## ILMN_1893287     40.85000     44.29167     38.42105
## ILMN_1736104     40.53333     46.50000     33.46154
## ILMN_1792389    112.90909    153.17647    122.65000
## ILMN_1854015     50.47059     53.26087     51.57143
## ILMN_1904757     41.45833     42.10000     49.92593
## ILMN_1740305     38.45455     51.50000     46.21429
## ILMN_1665168     42.38889     37.95000     30.46154
## ILMN_2375156     61.47368     72.73913     52.46154
## ILMN_1705423     42.38889     28.14286     38.62500

As we have seen, the expression matrix of the ExpressionSetIllumina object can be subset by column or row, In fact, the same subset operations can be performed on the ExpressionSetIllumina object itself. In the following code, notice how the number of samples and features changes in the output.

exampleSummaryData.log2[,1:4]
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 4 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B 4613710054_B 4616443079_B
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
##     total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, 4613710054_B, 4616443079_B
exampleSummaryData.log2[1:10,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 10 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1705423 (10 total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

The object can also be subset by a vector of characters which must correspond to the names of features (i.e. row names). Currently, no analogous functions is available to subset by sample.

randIDs <- sample(featureNames(exampleSummaryData), 1000)

exampleSummaryData[randIDs,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 1000 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1761422 ILMN_1693702 ... ILMN_1660426 (1000 total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

0.3 Exploratory analysis using boxplots

Boxplots of intensity levels and the number of beads are useful for quality assessment purposes. beadarray includes a modified version of the boxplot function that can take any valid ExpressionSetIllumina object and plot the expression matrix by default. For these examples we plot just a subset of the original exampleSummaryData object using random row IDs.

boxplot(exampleSummaryData.log2[randIDs,])

The function can also plot other assayData items, such as the number of observations.

boxplot(exampleSummaryData.log2[randIDs,], what="nObservations")

The default boxplot plots a separate box for each array, but often it is beneficial for compare expression levels between different sample groups. If this information is stored in the phenoData slot it can be incorporated into the plot. The following compares the overall expression level between UHRR and Brain samples.

boxplot(exampleSummaryData.log2[randIDs,], SampleGroup="SampleFac")

In a similar manner, we may wish to visualize the differences between sample groups for particular probe groups. As a simple example, we look at the difference between negative controls and regular probes for each array. You should notice that the negative controls as consistently lower (as expected) with the exception of array 4616443081_B.

boxplot(exampleSummaryData.log2[randIDs,], probeFactor = "Status")

Extra feature annotation is available from annotation packages in Bioconductor, and beadarray includes functionality to extract these data from the annotation packages. The annotation of the object must be set in order that the correct annotation package can be loaded. For example, the exampleSummaryData object was generated from Humanv3 data so the illuminaHumanv3.db package must be present. The addFeatureData function annotates all features of an ExpressionSetIllumina object using particular mappings from the illuminaHumanv3.db package. To see which mappings are available you can use the illuminaHumanv3() function, or equivalent from other packages.

annotation(exampleSummaryData)
## [1] "Humanv3"
exampleSummaryData.log2 <- addFeatureData(exampleSummaryData.log2, 
toAdd = c("SYMBOL", "PROBEQUALITY", "CODINGZONE", "PROBESEQUENCE", "GENOMICLOCATION"))

head(fData(exampleSummaryData.log2))
##                 Row.names ArrayAddressID   IlluminaID  Status SYMBOL
## ILMN_1802380 ILMN_1802380          10008 ILMN_1802380 regular   RERE
## ILMN_1893287 ILMN_1893287          10010 ILMN_1893287 regular   <NA>
## ILMN_1736104 ILMN_1736104          10017 ILMN_1736104 regular   <NA>
## ILMN_1792389 ILMN_1792389          10019 ILMN_1792389 regular  ARK2C
## ILMN_1854015 ILMN_1854015          10020 ILMN_1854015 regular   <NA>
## ILMN_1904757 ILMN_1904757          10021 ILMN_1904757 regular   <NA>
##              PROBEQUALITY      CODINGZONE
## ILMN_1802380      Perfect  Transcriptomic
## ILMN_1893287          Bad Transcriptomic?
## ILMN_1736104          Bad      Intergenic
## ILMN_1792389      Perfect  Transcriptomic
## ILMN_1854015          Bad      Intergenic
## ILMN_1904757   Perfect*** Transcriptomic?
##                                                   PROBESEQUENCE
## ILMN_1802380 GCCCTGACCTTCATGGTGTCTTTGAAGCCCAACCACTCGGTTTCCTTCGG
## ILMN_1893287 GGATTTCCTACACTCTCCACTTCTGAATGCTTGGAAACACTTGCCATGCT
## ILMN_1736104 TGCCATCTTTGCTCCACTGTGAGAGGCTGCTCACACCACCCCCTACATGC
## ILMN_1792389 CTGTAGCAACGTCTGTCAGGCCCCCTTGTGTTTCATCTCCTGCGCGCGTA
## ILMN_1854015 GCAGAAAACCATGAGCTGAAATCTCTACAGGAACCAGTGCTGGGGTAGGG
## ILMN_1904757 AGCTGTACCGTGGGGAGGCTTGGTCCTCTTGCCCCATTTGTGTGATGTCT
##                         GENOMICLOCATION
## ILMN_1802380     chr1:8412758:8412807:-
## ILMN_1893287   chr9:42489407:42489456:+
## ILMN_1736104 chr3:134572184:134572223:-
## ILMN_1792389  chr18:44040244:44040293:+
## ILMN_1854015 chr3:160827837:160827885:+
## ILMN_1904757 chr3:197872267:197872316:+
illuminaHumanv3()
## ####Mappings based on RefSeqID####
## Quality control information for illuminaHumanv3:
## 
## 
## This package has the following mappings:
## 
## illuminaHumanv3ACCNUM has 31857 mapped keys (of 49576 keys)
## illuminaHumanv3ALIAS2PROBE has 63289 mapped keys (of 260933 keys)
## illuminaHumanv3CHR has 29550 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLENGTHS has 93 mapped keys (of 711 keys)
## illuminaHumanv3CHRLOC has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLOCEND has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL has 29154 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL2PROBE has 20997 mapped keys (of 40839 keys)
## illuminaHumanv3ENTREZID has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME has 3526 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME2PROBE has 967 mapped keys (of 975 keys)
## illuminaHumanv3GENENAME has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3GO has 26989 mapped keys (of 49576 keys)
## illuminaHumanv3GO2ALLPROBES has 19471 mapped keys (of 22446 keys)
## illuminaHumanv3GO2PROBE has 15200 mapped keys (of 18678 keys)
## illuminaHumanv3MAP has 29402 mapped keys (of 49576 keys)
## illuminaHumanv3OMIM has 21943 mapped keys (of 49576 keys)
## illuminaHumanv3PATH has 9180 mapped keys (of 49576 keys)
## illuminaHumanv3PATH2PROBE has 229 mapped keys (of 229 keys)
## illuminaHumanv3PMID has 29329 mapped keys (of 49576 keys)
## illuminaHumanv3PMID2PROBE has 439054 mapped keys (of 800658 keys)
## illuminaHumanv3REFSEQ has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3SYMBOL has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3UNIPROT has 27704 mapped keys (of 49576 keys)
## 
## 
## Additional Information about this package:
## 
## DB schema: HUMANCHIP_DB
## DB schema version: 2.1
## Organism: Homo sapiens
## Date for NCBI data: 2015-Mar17
## Date for GO data: 20150314
## Date for KEGG data: 2011-Mar15
## Date for Golden Path data: 2010-Mar22
## Date for Ensembl data: 2015-Mar13
## ####Custom Mappings based on probe sequence####
## illuminaHumanv3ARRAYADDRESS()
## illuminaHumanv3NUID()
## illuminaHumanv3PROBEQUALITY()
## illuminaHumanv3CODINGZONE()
## illuminaHumanv3PROBESEQUENCE()
## illuminaHumanv3SECONDMATCHES()
## illuminaHumanv3OTHERGENOMICMATCHES()
## illuminaHumanv3REPEATMASK()
## illuminaHumanv3OVERLAPPINGSNP()
## illuminaHumanv3ENTREZREANNOTATED()
## illuminaHumanv3GENOMICLOCATION()
## illuminaHumanv3SYMBOLREANNOTATED()
## illuminaHumanv3REPORTERGROUPNAME()
## illuminaHumanv3REPORTERGROUPID()
## illuminaHumanv3ENSEMBLREANNOTATED()

If we suspect that a particular gene may be differentially expressed between conditions, we can subset the ExpressionSetIllumina object to just include probes that target the gene, and plot the response of these probes against the sample groups. Furthermore, the different probes can be distinguished using the probeFactor parameter.

ids <- which(fData(exampleSummaryData.log2)[,"SYMBOL"] == "ALB")

boxplot(exampleSummaryData.log2[ids,], 
  SampleGroup = "SampleFac", probeFactor = "IlluminaID")