1 Metabolomics data pre-processing

1.1 Setup

This workflow requires R version >= 3.4 and the following packages:

xcms: for pre-processing of metabolomics data. A package version >= 2.99.0 is required which, if not yet in Bioconductor, can be installed from github using devtools::install_github("sneumann/xcms", ref = "xcms3"). The package depends on mzR for data import and uses classes from the MSnbase package.
faahKO: provides the data set processed here.
RColorBrewer: for color definitions.
doParallel: to set-up parallel processing. xcms uses parallel processing with the BiocParallel package in most algorithms. While BiocParallel’s default parallel processing setup (i.e. using multicore on Unix systems and snow for Windows) works on most systems this seems to be problematic/erratic on recent macOS versions. Pre-registering and setting up the parallel processing first using the doParallel package and using this in BiocParallel seems to avoid such problems.

## Style
library(BiocStyle)
## Data processing
library(xcms)
## Toy data set for data processing
library(faahKO)
## For color definitions
library(RColorBrewer)
## For a working parallel processing setup on Mac systems
library(doParallel)

Parallel processing setup. As detailed above we’re using the doParallel setup with a pre-defined set of cluster nodes.

## Set up the parallel processing using doParallel.
registerDoParallel(4)
## Register this setup for BiocParallel and make it default.
register(DoparParam(), default = TRUE)

1.2 Introduction

This workflow illustrates the use of Bioconductor packages for the pre-processing of LCMS-based metabolomics data sets. This pre-processing aims to identify the features possibly corresponding to metabolites and to quantify them. While other package exist too (such as yamss [1]), this workshop focuses on xcms [2] and uses the functionality of its new user interface.

The topics covered in this document are:

Data import, organization and handling.
Chromatographic peak detection.
Alignment (retention time correction).
Correspondence (matching of chromatographic peaks within and between samples).

The result from this metabolomics pre-processing is a table of intensities with rows corresponding to features (ions with a unique mass-to-charge ratio and retention time) and columns to samples.

The workflow does not cover (yet) data normalization strategies or identification of metabolites (i.e. mapping of detected features to corresponding metabolites). Also, the workflow focuses on gas/liquid chromatography (GC/LC) mass spectrometry (MS) which combines the physical separation capabilities of the chromatography with the mass analysis capabilities of the MS. Chromatography is usually preferred in mass spectrometry experiments with complex samples because the added dimension (the retention time rt) helps to discriminate between analytes with identical mass-to-charge ratios (mz).

For more details and description of the general metabolomics data workflow see [3][4].

1.3 Definitions and naming convention

In metabolomics data analysis, peaks are identified along the retention time axis. These peaks are referred to as chromatographic peaks in this document to distinguish them from mass peaks (i.e. peaks within a spectrum along the mz dimension). The term feature refers to individual ions with a unique mass-to-charge ratio (m/z, or mz) and a unique retention time (rt). Such features are defined during the correspondence step in which the detected chromatographic peaks are matched between (and within) samples.

1.4 Data import, organization and handling

A variety of different, vendor-specific, file formats exist for proteomics/metabolomics. This data files have to be converted first into one of the file formats supported by the mzR package, i.e. the open mzML or the netCDF file format. For metabolomics experiments it is suggested to load the data using the readMSData2 function (from the MSnbase package [5]) into an OnDiskMSnExp object. Such objects hold only general spectra data (such as number of spectra in a file or the retention time of each spectrum) in memory while the full raw data (the mz and intensity values per spectrum) is accessed on demand. This keeps the memory footprint small and enables thus analyses also of large scale metabolomics experiments (see help pages and vignettes of the MSnbase package for more details on the OnDiskMSnExp object). The new user interface of the xcms package, that’s used in this document, makes extensive use of objects from the MSnbase package and hence reuses all of their functionality.

Below we read the raw data files (in netCDF format) from the faahKO data package. This package provides a subset of the data from [6] examining the metabolic consequences of the knock-out of the fatty acid amide hydrolase (Faah) gene in mice. The data subset comprises spinal cord samples from 6 knock-out and 6 wild type mice. Each file contains centroided data from an LC-MS experiment acquired in positive ion mode in a mz range from 200-600 and retention time range from 2500-4500 seconds.

## Get the file names
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
         full.names = TRUE)

## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that can be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
                      replacement = "", fixed = TRUE),
            sample_group = s_groups, stringsAsFactors = FALSE)

## Read the data.
faahKO_raw <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))

The data is organized by spectrum, i.e. for each retention time we have a Spectrum1 object containing the mz and intensity duplets measured by the mass spec. Below we extract one of the spectra and evaluate the data it contains. Individual spectra can be extracted using [[, which causes the MS data for the particular spectrum to be imported from the original data file.

## Access the 3rd spectrum in the data set.
spctr <- faahKO_raw[[3]]

## Get the retention time when the spectrum was measured
rtime(spctr)

## [1] 2501.38

## Access the mz of the spectrum
head(mz(spctr))

## [1] 201.0 202.1 203.1 204.1 205.1 205.8

## And the associated intensities
head(intensity(spctr))

## [1] 1083  672 1211  569 1244  452

## Optionally plot the Spectrum by plotting the mz values on the x- and
## the associated intensities on the y-axis.

We can use various accessor functions to extract information from the OnDiskMSnExp object, such as rtime to get the retention time for each spectrum. Many of these methods directly access information stored in the object’s fData (corresponding to the spectrum headers in the mzML/netCDF files) and are thus very fast. The mz, intensity and spectra methods on the other hand read the original data on the fly and are thus slower.

It is also important to note that the spectrum data within the object is not organized by sample and data is always returned as a one-dimensional vector. The association between a spectrum and the file from which it originates is provided by the fromFile method which returns an integer vector with the index of the file from which the spectrum was extracted.

Various filter methods allow a fast and simple sub-setting of the full experiment. In the example below we create a total ion chromatogram (TIC) using the filterFile method to subset the object to data from a certain file. The total ion current per spectrum is extracted with the tic method. The TIC plots the sum of all measured intensities for a given retention time (i.e. a spectrum) against the retention time.

## Define the sample colors
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")

## Subset the full raw data by file and plot it.
tmp <- filterFile(faahKO_raw, file = 1)

plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
     col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(faahKO_raw))) {
    tmp <- filterFile(faahKO_raw, file = i)
    points(rtime(tmp), tic(tmp), type = "l",
       col = paste0(sample_colors[pData(tmp)$sample_group], 80))
}
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)

The TIC or the base peak chromatogram (BPC, maximum signal per spectrum against its retention time) are useful plots to get a first general overview of an experiment and can also be used for quality control purposes, e.g. to spot problematic samples. Plotting the distribution of the total ion currents (tic) or the base peak intensities (bpi) per file using boxplots can also be used for quality assessment.

Exercise 1: alternative approach to plot the TIC: use the tic and fromFile methods instead.
Solution:

## The tic returns a vector, one value for each spectrum in the experiment. The
## values are not organized by sample/file
head(tic(faahKO_raw))

## The fromFile method returns the index from the file the spectrum derives
head(fromFile(faahKO_raw))

## Extract the total ion current and retention times and split them by file.
tics <- split(tic(faahKO_raw), f = fromFile(faahKO_raw))
rts <- split(rtime(faahKO_raw), f = fromFile(faahKO_raw))
## Define the color for each sample
cols <- paste0(sample_colors[pData(faahKO_raw)$sample_group], 80)
## initialize plot
plot(3, 3, pch = NA, xlim = range(rts), ylim = range(tics), main = "TIC",
     xlab = "retention time", ylab = "intensity")
tmp <- mapply(rts, tics, cols, FUN = function(x, y, col) {
    points(x = x, y = y, col = col, type = "l")
})

In most mzML and netCDF files the MS data is organized by spectrum (i.e. intensity values by their corresponding mz value) and, as detailed above, also the OnDiskMSnExp object returns data by spectrum. In LC-MS metabolomics, however, peak detection is performed (for small slices along the mz dimension) in the time dimension and hence orthogonally to the spectrum data. To extract intensity data by retention time, xcms defines the extractChromatograms method and the Chromatogram class. Below we create the base peak chromatogram (BPC, maximum signal per spectrum against its retention time). Usually we could use the bpi method similarly to the tic method, but the present netCDF files do not provide the base peak intensities in the spectrum header information. We thus have to create the BPC using the extractChromatogram method that loads the full spectrum data from all files and aggregates the intensities per spectrum. The result is returned as a list of Chromatogram objects, one for each file. This is relatively fast for the present files (also because data is read in parallel) but can be slow with larger, higher resolution, MS experiments.

## Extract chromatograms for the full mz and rt range. By specifying
## aggregationFun = "max" we extract the maximum intensity per spectrum and
## get hence base peak chromatograms
chrs <- extractChromatograms(faahKO_raw, aggregationFun = "max")

## Plot the chromatograms
plotChromatogram(
    chrs,
    col = paste0(sample_colors[pData(faahKO_raw)$sample_group], 80))
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)

The BPC are similar between individual samples, but seem to be shifted in retention time dimension. This shift will be corrected in the alignment/retention time adjustment step.

Next we visualize the chromatogram for specific ions, i.e. for a small mz range and/or retention time window to inspect what type of chromatographic peaks have to be identified in the present LC-MS experiment.

## Extract the chromatogram for one mz value and a given rt range
chrs <- extractChromatograms(faahKO_raw, mz = 335, rt = c(2700, 2900))

plotChromatogram(chrs,
         col = paste0(sample_colors[pData(faahKO_raw)$sample_group], 80))

Chromatographic peak example. Extracted ion chromatogram for mz = 335 and a retention time from 2700 to 2900 seconds. Each line representing the signal measured in one sample.

Figure 1: Chromatographic peak example
Extracted ion chromatogram for mz = 335 and a retention time from 2700 to 2900 seconds. Each line representing the signal measured in one sample.

The chromatographic peaks are about 40-50 seconds wide in this experiment. Note that not in all spectra (for all retention times) a signal was measured for the given mz range. The lines are thus not continuous in the plot above.

For the maximal intensity measured of the chromatographic peak we can also extract the corresponding spectrum in a file. Below we extract such spectrum for the first file and plot it.

## Subsetting the original object to the given retention time range and file,
## this returns an OnDiskMSnExp referencing to a single spectrum.
subs <- filterFile(filterRt(faahKO_raw, rt = c(2779, 2781)), file = 1)

## Extract the Spectrum
spctr <- spectra(subs)[[1]]

plot(mz(spctr), intensity(spctr), type = "h", xlab = "mz", ylab = "intensity")
points(x = 335, y = -10000, pch = 2)

Spectrum for rt of 2780 seconds. Spectrum for the retention time associated with the highest signal of the chromatographic peak in the first file. The triangle indicates the mz corresponding to the chromatographic peak shown above.

Figure 2: Spectrum for rt of 2780 seconds
Spectrum for the retention time associated with the highest signal of the chromatographic peak in the first file. The triangle indicates the mz corresponding to the chromatographic peak shown above.

Apparently there are many mass peaks present at the specific retention time, most of them larger than the one of our example chromatographic peak.

1.5 Chromatographic peak detection

The first task in the pre-processing of LC-MS metabolomics data is the detection of peaks in the retention time dimension (i.e. chromatographic peaks) for MS data slices along the mz dimension. The most commonly used algorithm is centWave [7] that performs a relatively robust peak detection. Peak detection can be performed on OnDiskMSnExp objects using the findChromPeaks method providing in addition an algorithm-specific parameter class, such as an CentWaveParam for centWave based peak detection, or MatchedFilterParam for peak detection using the matched filter algorithm [2].

Below we use the default parameters for the peak detection (which is however never a good idea in LC-MS data pre-processing because peak shape and MS data are highly dependent on the experimental setup). The peak detection is carried out in parallel for each file.

## Create the parameter object for centWave
cwp <- CentWaveParam(noise = 200)
faahKO <- findChromPeaks(faahKO_raw, param = cwp)
faahKO

## MSn experiment data ("XCMSnExp")
## Object size in memory: 3.34 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 15336 
##  MSn retention times: 41:41 - 74:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Jun 14 09:45:26 2017] 
## Filter: select MS level(s) 1 [Wed Jun 14 09:45:37 2017] 
##  MSnbase version: 2.3.4 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1 2 ... 12 (12 total)
##   varLabels: sample_name sample_group
##   varMetadata: labelDescription
## Loaded from:
##   [1] ko15.CDF...  [12] wt22.CDF
##   Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
##   featureNames: X0001.1 X0001.10 ... X1278.9 (15336 total)
##   fvarLabels: fileIdx spIdx ... spectrum (25 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  21704 peaks identified in 12 samples.
##  On average 1809 chromatographic peaks per sample.

The result from the peak detection is an XCMSnExp object, which is an extension to the OnDiskMSnExp object. While being a container for metabolomics pre-processing results, this object inherits the OnDiskMSnExp’s ability to access the raw data files. Below we access the results from this analysis step using the chromPeaks method.

head(chromPeaks(faahKO))

##         mz mzmin mzmax       rt    rtmin    rtmax        into        intb
## [1,] 425.9 425.9 425.9 2520.158 2510.768 2527.982    9999.769    9984.120
## [2,] 464.3 464.3 464.3 2518.593 2504.508 2532.677   32103.270   32082.926
## [3,] 499.1 499.1 499.1 2524.852 2520.158 2527.982    4979.194    4904.709
## [4,] 572.7 572.7 572.7 2524.852 2520.158 2527.982    2727.446    2721.187
## [5,] 579.8 579.8 579.8 2524.852 2520.158 2527.982    2450.477    2444.218
## [6,] 453.2 453.2 453.2 2506.073 2501.378 2527.982 1007408.973 1007380.804
##       maxo    sn sample is_filled
## [1,]   741   740      1         0
## [2,]  1993  1992      1         0
## [3,]   883    13      1         0
## [4,]   559   558      1         0
## [5,]   468   467      1         0
## [6,] 38152 38151      1         0

Each line in the matrix represents a chromatographic peak identified in one sample. The index of the file in which the peak was detected is given in column "sample" while the definition of the peak is provided in columns "mzmin" , "mzmax" , "rtmin" and "rtmax" and the peaks intensities in columns "into" (integrated peak signal) and "maxo" (maximum signal at the peak’s apex).

The XCMSnExp object keeps also track of all performed processing steps storing also the employed parameter classes and guaranteeing hence full reproducibility. This information can be accessed with the processHistory method that returns a list of processing steps. Below we use this method to extract the parameter class used for the chromatographic peak detection.

## Getting the first process history step, in our case the chromatographic
## peak detection.
ph <- processHistory(faahKO)[[1]]

ph

## Object of class "XProcessHistory"
##  type: Peak detection 
##  date: Wed Jun 14 09:45:37 2017 
##  info:  
##  fileIndex: 1,2,3,4,5,6,7,8,9,10,11,12 
##  Parameter class: CentWaveParam

## Extracting the Parameter class employed
processParam(ph)

## Object of class:  CentWaveParam 
## Parameters:
##  ppm: 25 
##  peakwidth: 20, 50 
##  snthresh: 10 
##  prefilter: 3, 100 
##  mzCenterFun: wMean 
##  integrate: 1 
##  mzdiff: -0.001 
##  fitgauss: FALSE 
##  noise: 200 
##  verboseColumns: FALSE 
##  roiList length: 0 
##  firstBaselineCheck TRUE 
##  roiScales length: 0

Whether peak detection was successful is hard to tell. The numbers of detected peaks can provide some first information (Is the number much lower than expected? Are there files with considerably fewer peaks?). Also summaries of the rt and mz widths of identified peaks might be informative. Plotting the raw data and visually inspecting the detected peaks represents however one of the best options to estimate peak detection performance. This is in most cases done on a handful of known compounds or internal control compounds added to each sample. The new user interface facilitates extraction of full, or small slices of the MS data and enables an easy access to the original (or processed) data at any stage. Performance is guaranteed by making use of the indexing capabilities of mzML and netCDF files reading only sub-sets of the data where possible. The getEIC method from the old xcms user interface provided similar functionality but loaded the full data with each call. Also, not the original values were returned, but intensities from the profile matrix which contained intensities binned in equidistant slices along the mz dimension.

Below we plot the chromatogram for a mass-to-charge ratio of mz = 335 (and a retention time window from 2700 to 2900 seconds) and highlight also all identified chromatographic peaks in that region.

## Extract the chromatogram for one mz value and a given rt range
chrs <- extractChromatograms(faahKO, mz = 335, rt = c(2700, 2900))

plotChromatogram(chrs,
         col = paste0(sample_colors[pData(faahKO)$sample_group], 80))
highlightChromPeaks(
    faahKO, rt = c(2700, 2900), mz = 335,
    border = paste0(sample_colors[pData(faahKO)$sample_group], 40))

Figure 3: Chromatographic peak example
Extracted ion chromatogram for mz = 335 and a retention time from 2700 to 2900 seconds. Each line representing the signal measured in one sample. Rectangles indicate the identified chromatographic peaks.

Over and above the peak detection seemed to be OK although in some samples no peaks were identified, mostly due to low (and/or sparse) signal intensities.

The chromPeaks method allows also to retrieve peaks for a specific mz or rt range. This enables to evaluate whether and how many chromatographic peaks have been detected for a certain mz-rt region. Below we extract all peaks identified in the above mz-rt region.

## Extract detected peaks for a mz-rt region. The parameter ppm allows to
## extend the mz range slightly
chromPeaks(faahKO, mz = 335, rt = c(2700, 2900), ppm = 10)

##       mz mzmin mzmax       rt    rtmin    rtmax      into      intb  maxo sn
## [1,] 335   335   335 2781.505 2761.160 2809.674  412134.3  382797.5 16856 22
## [2,] 335   335   335 2786.199 2764.290 2812.803 1496244.2 1462919.0 58736 78
## [3,] 335   335   335 2786.201 2764.291 2812.805  211297.2  195961.1  8158 17
## [4,] 335   335   335 2783.069 2762.725 2812.804  285228.1  271885.1  9154 21
## [5,] 335   335   335 2786.199 2764.290 2812.803  932645.2  920464.6 35856 86
## [6,] 335   335   335 2787.765 2765.855 2820.629 1636669.0 1605742.7 54640 73
## [7,] 335   335   335 2784.635 2759.595 2809.674  643673.2  631323.7 20672 53
## [8,] 335   335   335 2792.461 2768.987 2823.760  876585.5  847887.3 27200 39
##      sample is_filled
## [1,]      1         0
## [2,]      2         0
## [3,]      3         0
## [4,]      4         0
## [5,]      8         0
## [6,]      9         0
## [7,]     10         0
## [8,]     11         0

As we have already seen above, a peak was detected in most samples.

To emphasize the need to adapt the peak detection algorithm setting to each setup/experiment we load an mzML file from a completely different experimental setup and perform a centWave peak detection using default settings.

## Load one file from a different setup.
fl <- paste0("./data/","250516_POOL_N_POS_28.mzML.gz")
raw_data <- readMSData2(fl)

## Run peak detection using default CentWave.
proc_data <- findChromPeaks(raw_data, param = CentWaveParam())

proc_data

## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.36 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 1721 
##  MSn retention times: 0:0 - 8:0 minutes
## - - - Processing information - - -
## Data loaded [Wed Jun 14 09:46:27 2017] 
## Filter: select MS level(s) 1 [Wed Jun 14 09:46:28 2017] 
##  MSnbase version: 2.3.4 
## - - - Meta data  - - -
## phenoData
##   rowNames: 250516_POOL_N_POS_28.mzML.gz
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   250516_POOL_N_POS_28.mzML.gz 
## protocolData: none
## featureData
##   featureNames: X0001.1 X0002.1 ... X1721.1 (1721 total)
##   fvarLabels: fileIdx spIdx ... spectrum (26 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  153 peaks identified in 1 samples.
##  On average 153 chromatographic peaks per sample.

The number of detected peaks is very low, much lower than expected.

From the setup it is known that some compounds should be present/detected in the sample. One of these is glycine with an expected mz of 76.03969968. Allowing a ppm of 20 we extract all identified peaks at about the expected mz.

mz_glyc <- 76.03969968

## Extract chromatographic peaks matching the mz of glycine, allowing
## a 20ppm deviation.
pks <- chromPeaks(proc_data, mz = mz_glyc, ppm = 20)
pks

##      mz mzmin mzmax rt rtmin rtmax into intb maxo sn sample is_filled

Not a single peak was detected in the expected region. Next we extract and plot the corresponding ion chromatogram to evaluate what signal is present in the region.

## Extend the mz range by 10 ppm on both sides.
mzr <- c(mz_glyc - mz_glyc * 10 / 1e6, mz_glyc + mz_glyc * 10 / 1e6)

## Extract the ion chromatogram for glycine
eic_glyc <- extractChromatograms(proc_data, mz = mzr, rt = c(165, 180))

## Plot the chromatogram
plotChromatogram(eic_glyc, rt = c(165, 180))

Figure 4: Extracted ion chromatogram for glycine

There is signal at the expected mz/rt, but why was this peak not detected?

Exercise 2: inspecting the chromatographic peak for glycine, how could you improve the centWave peak detection settings? Run peak detection with the modified settings and evaluate the results.
Solution: the chromatographic peaks are too narrow to be detected using the default settings. Adjust the peakwidth parameter to represent the expected range of peak widths.

## Default centWave settings
CentWaveParam()

## The rt width of the peak is much smaller than the default 20-50 seconds.

## Adapt the peakwidth parameter and re-run the peak detection
cwp <- CentWaveParam(peakwidth = c(2, 10))
proc_data <- findChromPeaks(raw_data, param = cwp)

## Numer of detected peaks:
nrow(chromPeaks(proc_data))

## Average rt width
mean(chromPeaks(proc_data)[, "rtmax"] - chromPeaks(proc_data)[, "rtmin"])

## Do we find a glycine peak?
chromPeaks(proc_data, mz = mz_glyc, ppm = 20)

## Yes we do, and at the expected rt.

## plot the data and highlight the peak.
plotChromatogram(eic_glyc)
highlightChromPeaks(proc_data, mz = mz_glyc, rt = c(165, 180), ppm = 20)

## Peak is eventually even a little too broad.

The IPO Bioconductor package [8] provides functionality for an automatic tuning of xcms peak detection parameters and is thus a good starting point to automatically tune parameters for a specific metabolomics setup/experiment. Visual inspection of identified peaks is however crucial to guarantee proper peak detection.

1.6 Alignment

The time at which analytes elute in the chromatography can vary between samples (and even compounds). Such a difference was already observable in the extracted ion chromatogram plot shown as an example in the previous section. The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.

A plethora of alignment algorithms exist (see [9]), with some of them being implemented also in xcms. The method to perform the alignment/retention time correction in xcms is adjustRtime which uses different alignment algorithms depending on the provided parameter class. In the example below we use the obiwarp method [10] to align the samples. We use a binSize = 0.6 which creates warping functions in mz bins of 0.6. Also here it is advisable to modify the settings for each experiment and evaluate if retention time correction did align internal controls or known compounds properly.

## Define the parameters to the obiwarp method
owp <- ObiwarpParam(binSize = 0.6)

faahKO <- adjustRtime(faahKO, param = owp)

The result from the adjustRtime call is the same XCMSnExp object containing in addition the adjusted retention times. The hasAdjustedRtime method can be used to evaluate if the object contains adjusted retention times that can be extracted using the adjustedRtime method. If an XCMSnExp object contains alignment results, the rtime method does also by default return the adjusted retention times. Raw retention times can then be extracted by passing adjusted = FALSE to the rtime method. Note also that by passing the argument bySample = TRUE, the rtime and adjustedRtime methods allow to extract retention time grouped by samples.

Below we simply test these methods to get a feeling of the type of result objects are returned.

## Do we have adjusted retention times?
hasAdjustedRtime(faahKO)

## [1] TRUE

## Get adjusted retention times for the first 6 spectra
head(rtime(faahKO))

##  X0001.1 X0001.10 X0001.11 X0001.12  X0001.2  X0001.3 
## 2501.378 2501.378 2501.378 2501.378 2501.378 2501.378

## And the raw retention times
head(rtime(faahKO, adjusted = FALSE))

##  X0001.1 X0001.10 X0001.11 X0001.12  X0001.2  X0001.3 
## 2501.378 2501.378 2501.380 2501.378 2501.378 2501.379

## By default the methods return again values per spectra. We can however pass
## bySample = TRUE and the result is returned as a list of numeric vectors,
## each list element representing the retention times for all spectra from one
## file
length(rtime(faahKO, bySample = TRUE))

## [1] 12

To evaluate the impact of the alignment we plot a base peak chromatogram before and after retention time correction as well as the deviation between raw and adjusted retention times.

## Extract BPC for each file; this reads all data from the original files.
chrs <- extractChromatograms(faahKO, aggregationFun = "max")

## To plot the BPC with the raw retention times we have to extract the
## intensities from the Chromatogram objects and extract the raw rt from
## the XCMSnExp with rtime(faahKO, adjusted = FALSE)
## Note that using bySample = TRUE the method returns the retention times split
## by sample.
rt_raw <- rtime(faahKO, adjusted = FALSE, bySample = TRUE)
## Extract the (base peak) intensities of the chromatograms
ints <- lapply(chrs, intensity)

## Preparing the plot
par(mfrow = c(3, 1), mar = c(0.5, 4, 1, 0.5))
## Plot first the base peak chromatogram with the raw retention times.
plot(3, 3, pch = NA, xlab = "", ylab = "base peak intensity", xaxt = "n",
     main = "before adjustment", xlim = range(rt_raw), ylim = range(ints))
cols <- paste0(sample_colors[pData(faahKO)$sample_group], 80)
tmp <- mapply(rt_raw, ints, cols, FUN = function(x, y, col) {
    points(x, y, col = col, type = "l")
})

## Plot the base peak chromatograms with the adjusted retention times.
plotChromatogram(chrs, main = "after adjustment", col = cols, xaxt = "n")

## Plot the difference between adjusted and raw adjustment.
par(mar = c(4, 4, 0.5, 0.5))
plotAdjustedRtime(faahKO, col = cols)

Alignment results. Base peak chromatogram before and after retention time adjustment and difference between raw and adjusted retention times per file.

Figure 5: Alignment results
Base peak chromatogram before and after retention time adjustment and difference between raw and adjusted retention times per file.

The retention time adjustment did align most of the base peaks across samples. Between 3600 and 3800 seconds the alignment was however less optimal showing also the strongest retention time adjustment.

It is also important to note that the alignment step adjusts also the reported retention times for the detected chromatographic peaks. If we were not happy with the results from the alignment step we could also drop these results using the dropAdjustedRtime method in which case the raw retention times are restored (also for the detected chromatographic peaks).

Exercise 3: plot the chromatographic peak for mz = 335 and rt = c(2700, 2900) before and after retention time correction. Hint: for the peaks before alignment, extract the chromatograms either from the raw faahKO_raw object or use the dropAdjustedRtime method to restore raw retention times.
Solution:

## To extract the chromatogram before retention time adjustment we could
## drop the retention time adjustment results:
chrs_raw <- extractChromatograms(dropAdjustedRtime(faahKO),
                 rt = c(2700, 2900), mz = 335)

## Or, more simpler, just pass the adjustedRtime = FALSE parameter to the method.
chrs_raw <- extractChromatograms(faahKO, adjustedRtime = FALSE,
                 rt = c(2700, 2900), mz = 335)

## Define the color for each sample
cols <- paste0(sample_colors[pData(faahKO)$sample_group], 80)
par(mfrow = c(2, 1))
plotChromatogram(chrs_raw, col = cols)

## Extract the chromatograms after adjustment.
chrs_adj <- extractChromatograms(faahKO, rt = c(2700, 2900), mz = 335)
plotChromatogram(chrs_adj, col = cols)

## Alignment is not perfect for this paricular peak.

An alternative to the obiwarp method is the peakGroups method that performs the alignment using so called hook peaks that are present in most samples. The retention times are then adjusted by aligning these peaks and interpolating in the region between them. Such peaks are usually being identified using the peak density correspondence method by enforcing the chromatographic peaks to be present in most samples, could however also be defined manually.

Note that all alignment methods do depend on the assumption that the samples are similar (obiwarp: have similar chromatograms and peak groups: have metabolites present in all samples that can be used as hook peaks).

1.7 Topics not covered in this document

Normalization: within (per feature signal drift adjustment) batch and between batch normalization of intensity values should be performed.
Identification of features with different abundances: here we might simply use e.g. the limma package on the log2 transformed (and normalized) intensities to identify features that are different between groups.
Identification: annotation of features to metabolites/chemical compounds. Bioconductor’s CAMERA package might be a good starting point.

References

1. Myint L, Kleensang A, Zhao L, Hartung T, Hansen KD: Joint bounding of peaks across samples improves differential analysis in mass spectrometry-based metabolomics. Analytical chemistry 2017:acs.analchem.6b04719.

2. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78:779–787.

3. Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M: Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis. Current bioinformatics 2012, 7:96–108.

4. Smith R, Mathis AD, Ventura D, Prince JT: Proteomics, lipidomics, metabolomics: a mass spectrometry tutorial from a computer scientist’s point of view. BMC Bioinformatics 2014, 15 Suppl 7(Suppl 7):S9.

5. Gatto L, Lilley KS: MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 2012, 28:288–289.

6. Saghatelian A, Trauger SA, Want EJ, Hawkins EG, Siuzdak G, Cravatt BF: Assignment of endogenous substrates to enzymes by global metabolite profiling. Biochemistry 2004, 43:14332–14339.

7. Tautenhahn R, Böttcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 2008, 9:504.

8. Libiseller G, Dvorzak M, Kleb U, Gander E, Eisenberg T, Madeo F, Neumann S, Trausinger G, Sinner F, Pieber T, Magnes C: IPO: a tool for automated optimization of XCMS parameters. BMC Bioinformatics 2015, 16:118.

9. Smith R, Ventura D, Prince JT: LC-MS alignment in theory and practice: a comprehensive algorithmic review. Briefings in bioinformatics 2013, 16:bbt080–117.

10. Prince JT, Marcotte EM: Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry 2006, 78:6140–6152.

Metabolomics data pre-processing

14 June 2017