This workflow requires R version >= 3.4 and the following packages:
xcms
: for pre-processing of metabolomics data. A package version >= 2.99.0 is required which, if not yet in Bioconductor, can be installed from github using devtools::install_github("sneumann/xcms", ref = "xcms3")
. The package depends on mzR
for data import and uses classes from the MSnbase
package.faahKO
: provides the data set processed here.RColorBrewer
: for color definitions.doParallel
: to set-up parallel processing. xcms
uses parallel processing with the BiocParallel
package in most algorithms. While BiocParallel
’s default parallel processing setup (i.e. using multicore
on Unix systems and snow
for Windows) works on most systems this seems to be problematic/erratic on recent macOS versions. Pre-registering and setting up the parallel processing first using the doParallel
package and using this in BiocParallel
seems to avoid such problems.## Style
library(BiocStyle)
## Data processing
library(xcms)
## Toy data set for data processing
library(faahKO)
## For color definitions
library(RColorBrewer)
## For a working parallel processing setup on Mac systems
library(doParallel)
Parallel processing setup. As detailed above we’re using the doParallel
setup with a pre-defined set of cluster nodes.
## Set up the parallel processing using doParallel.
registerDoParallel(4)
## Register this setup for BiocParallel and make it default.
register(DoparParam(), default = TRUE)
This workflow illustrates the use of Bioconductor packages for the pre-processing of LCMS-based metabolomics data sets. This pre-processing aims to identify the features possibly corresponding to metabolites and to quantify them. While other package exist too (such as yamss
[1]), this workshop focuses on xcms
[2] and uses the functionality of its new user interface.
The topics covered in this document are:
The result from this metabolomics pre-processing is a table of intensities with rows corresponding to features (ions with a unique mass-to-charge ratio and retention time) and columns to samples.
The workflow does not cover (yet) data normalization strategies or identification of metabolites (i.e. mapping of detected features to corresponding metabolites). Also, the workflow focuses on gas/liquid chromatography (GC/LC) mass spectrometry (MS) which combines the physical separation capabilities of the chromatography with the mass analysis capabilities of the MS. Chromatography is usually preferred in mass spectrometry experiments with complex samples because the added dimension (the retention time rt) helps to discriminate between analytes with identical mass-to-charge ratios (mz).
For more details and description of the general metabolomics data workflow see [3][4].
In metabolomics data analysis, peaks are identified along the retention time axis. These peaks are referred to as chromatographic peaks in this document to distinguish them from mass peaks (i.e. peaks within a spectrum along the mz dimension). The term feature refers to individual ions with a unique mass-to-charge ratio (m/z, or mz) and a unique retention time (rt). Such features are defined during the correspondence step in which the detected chromatographic peaks are matched between (and within) samples.
A variety of different, vendor-specific, file formats exist for proteomics/metabolomics. This data files have to be converted first into one of the file formats supported by the mzR
package, i.e. the open mzML or the netCDF file format. For metabolomics experiments it is suggested to load the data using the readMSData2
function (from the MSnbase
package [5]) into an OnDiskMSnExp
object. Such objects hold only general spectra data (such as number of spectra in a file or the retention time of each spectrum) in memory while the full raw data (the mz and intensity values per spectrum) is accessed on demand. This keeps the memory footprint small and enables thus analyses also of large scale metabolomics experiments (see help pages and vignettes of the MSnbase
package for more details on the OnDiskMSnExp
object). The new user interface of the xcms
package, that’s used in this document, makes extensive use of objects from the MSnbase
package and hence reuses all of their functionality.
Below we read the raw data files (in netCDF format) from the faahKO
data package. This package provides a subset of the data from [6] examining the metabolic consequences of the knock-out of the fatty acid amide hydrolase (Faah) gene in mice. The data subset comprises spinal cord samples from 6 knock-out and 6 wild type mice. Each file contains centroided data from an LC-MS experiment acquired in positive ion mode in a mz range from 200-600 and retention time range from 2500-4500 seconds.
## Get the file names
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
full.names = TRUE)
## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that can be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = s_groups, stringsAsFactors = FALSE)
## Read the data.
faahKO_raw <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))
The data is organized by spectrum, i.e. for each retention time we have a Spectrum1
object containing the mz and intensity duplets measured by the mass spec. Below we extract one of the spectra and evaluate the data it contains. Individual spectra can be extracted using [[
, which causes the MS data for the particular spectrum to be imported from the original data file.
## Access the 3rd spectrum in the data set.
spctr <- faahKO_raw[[3]]
## Get the retention time when the spectrum was measured
rtime(spctr)
## [1] 2501.38
## Access the mz of the spectrum
head(mz(spctr))
## [1] 201.0 202.1 203.1 204.1 205.1 205.8
## And the associated intensities
head(intensity(spctr))
## [1] 1083 672 1211 569 1244 452
## Optionally plot the Spectrum by plotting the mz values on the x- and
## the associated intensities on the y-axis.
We can use various accessor functions to extract information from the OnDiskMSnExp
object, such as rtime
to get the retention time for each spectrum. Many of these methods directly access information stored in the object’s fData
(corresponding to the spectrum headers in the mzML/netCDF files) and are thus very fast. The mz
, intensity
and spectra
methods on the other hand read the original data on the fly and are thus slower.
It is also important to note that the spectrum data within the object is not organized by sample and data is always returned as a one-dimensional vector. The association between a spectrum and the file from which it originates is provided by the fromFile
method which returns an integer vector with the index of the file from which the spectrum was extracted.
Various filter methods allow a fast and simple sub-setting of the full experiment. In the example below we create a total ion chromatogram (TIC) using the filterFile
method to subset the object to data from a certain file. The total ion current per spectrum is extracted with the tic
method. The TIC plots the sum of all measured intensities for a given retention time (i.e. a spectrum) against the retention time.
## Define the sample colors
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")
## Subset the full raw data by file and plot it.
tmp <- filterFile(faahKO_raw, file = 1)
plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(faahKO_raw))) {
tmp <- filterFile(faahKO_raw, file = i)
points(rtime(tmp), tic(tmp), type = "l",
col = paste0(sample_colors[pData(tmp)$sample_group], 80))
}
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)