1 Metabolomics data pre-processing

1.1 Setup

This workflow requires R version >= 3.4 and the following packages:

  • xcms: for pre-processing of metabolomics data. A package version >= 2.99.0 is required which, if not yet in Bioconductor, can be installed from github using devtools::install_github("sneumann/xcms", ref = "xcms3"). The package depends on mzR for data import and uses classes from the MSnbase package.
  • faahKO: provides the data set processed here.
  • RColorBrewer: for color definitions.
  • doParallel: to set-up parallel processing. xcms uses parallel processing with the BiocParallel package in most algorithms. While BiocParallel’s default parallel processing setup (i.e. using multicore on Unix systems and snow for Windows) works on most systems this seems to be problematic/erratic on recent macOS versions. Pre-registering and setting up the parallel processing first using the doParallel package and using this in BiocParallel seems to avoid such problems.
## Style
## Data processing
## Toy data set for data processing
## For color definitions
## For a working parallel processing setup on Mac systems

Parallel processing setup. As detailed above we’re using the doParallel setup with a pre-defined set of cluster nodes.

## Set up the parallel processing using doParallel.
## Register this setup for BiocParallel and make it default.
register(DoparParam(), default = TRUE)

1.2 Introduction

This workflow illustrates the use of Bioconductor packages for the pre-processing of LCMS-based metabolomics data sets. This pre-processing aims to identify the features possibly corresponding to metabolites and to quantify them. While other package exist too (such as yamss [1]), this workshop focuses on xcms [2] and uses the functionality of its new user interface.

The topics covered in this document are:

  • Data import, organization and handling.
  • Chromatographic peak detection.
  • Alignment (retention time correction).
  • Correspondence (matching of chromatographic peaks within and between samples).

The result from this metabolomics pre-processing is a table of intensities with rows corresponding to features (ions with a unique mass-to-charge ratio and retention time) and columns to samples.

The workflow does not cover (yet) data normalization strategies or identification of metabolites (i.e. mapping of detected features to corresponding metabolites). Also, the workflow focuses on gas/liquid chromatography (GC/LC) mass spectrometry (MS) which combines the physical separation capabilities of the chromatography with the mass analysis capabilities of the MS. Chromatography is usually preferred in mass spectrometry experiments with complex samples because the added dimension (the retention time rt) helps to discriminate between analytes with identical mass-to-charge ratios (mz).

For more details and description of the general metabolomics data workflow see [3][4].

1.3 Definitions and naming convention

In metabolomics data analysis, peaks are identified along the retention time axis. These peaks are referred to as chromatographic peaks in this document to distinguish them from mass peaks (i.e. peaks within a spectrum along the mz dimension). The term feature refers to individual ions with a unique mass-to-charge ratio (m/z, or mz) and a unique retention time (rt). Such features are defined during the correspondence step in which the detected chromatographic peaks are matched between (and within) samples.

1.4 Data import, organization and handling

A variety of different, vendor-specific, file formats exist for proteomics/metabolomics. This data files have to be converted first into one of the file formats supported by the mzR package, i.e. the open mzML or the netCDF file format. For metabolomics experiments it is suggested to load the data using the readMSData2 function (from the MSnbase package [5]) into an OnDiskMSnExp object. Such objects hold only general spectra data (such as number of spectra in a file or the retention time of each spectrum) in memory while the full raw data (the mz and intensity values per spectrum) is accessed on demand. This keeps the memory footprint small and enables thus analyses also of large scale metabolomics experiments (see help pages and vignettes of the MSnbase package for more details on the OnDiskMSnExp object). The new user interface of the xcms package, that’s used in this document, makes extensive use of objects from the MSnbase package and hence reuses all of their functionality.

Below we read the raw data files (in netCDF format) from the faahKO data package. This package provides a subset of the data from [6] examining the metabolic consequences of the knock-out of the fatty acid amide hydrolase (Faah) gene in mice. The data subset comprises spinal cord samples from 6 knock-out and 6 wild type mice. Each file contains centroided data from an LC-MS experiment acquired in positive ion mode in a mz range from 200-600 and retention time range from 2500-4500 seconds.

## Get the file names
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
         full.names = TRUE)

## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that can be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
                      replacement = "", fixed = TRUE),
            sample_group = s_groups, stringsAsFactors = FALSE)

## Read the data.
faahKO_raw <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))

The data is organized by spectrum, i.e. for each retention time we have a Spectrum1 object containing the mz and intensity duplets measured by the mass spec. Below we extract one of the spectra and evaluate the data it contains. Individual spectra can be extracted using [[, which causes the MS data for the particular spectrum to be imported from the original data file.

## Access the 3rd spectrum in the data set.
spctr <- faahKO_raw[[3]]

## Get the retention time when the spectrum was measured
## [1] 2501.38
## Access the mz of the spectrum
## [1] 201.0 202.1 203.1 204.1 205.1 205.8
## And the associated intensities
## [1] 1083  672 1211  569 1244  452
## Optionally plot the Spectrum by plotting the mz values on the x- and
## the associated intensities on the y-axis.

We can use various accessor functions to extract information from the OnDiskMSnExp object, such as rtime to get the retention time for each spectrum. Many of these methods directly access information stored in the object’s fData (corresponding to the spectrum headers in the mzML/netCDF files) and are thus very fast. The mz, intensity and spectra methods on the other hand read the original data on the fly and are thus slower.

It is also important to note that the spectrum data within the object is not organized by sample and data is always returned as a one-dimensional vector. The association between a spectrum and the file from which it originates is provided by the fromFile method which returns an integer vector with the index of the file from which the spectrum was extracted.

Various filter methods allow a fast and simple sub-setting of the full experiment. In the example below we create a total ion chromatogram (TIC) using the filterFile method to subset the object to data from a certain file. The total ion current per spectrum is extracted with the tic method. The TIC plots the sum of all measured intensities for a given retention time (i.e. a spectrum) against the retention time.

## Define the sample colors
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")

## Subset the full raw data by file and plot it.
tmp <- filterFile(faahKO_raw, file = 1)

plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
     col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(faahKO_raw))) {
    tmp <- filterFile(faahKO_raw, file = i)
    points(rtime(tmp), tic(tmp), type = "l",
       col = paste0(sample_colors[pData(tmp)$sample_group], 80))
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)