Package: xcms
Authors: Johannes Rainer
Modified: 2024-10-29 13:42:18.529053
Compiled: Tue Oct 29 23:27:38 2024

1 Introduction

The xcms package provides the functionality to perform the preprocessing of LC-MS, GC-MS or LC-MS/MS data in which raw signals from mzML, mzXML or CDF files are processed into feature abundances. This preprocessing includes chromatographic peak detection, sample alignment and correspondence analysis.

The first version of the package was already published in 2006 [1] and has since been updated and modernized in several rounds to better integrate it with other R-based packages for the analysis of untargeted metabolomics data. This includes version 3 of xcms that used the MSnbase package for MS data representation [2]. The most recent update (xcms version 4) enables in addition preprocessing of MS data represented by the modern MsExperiment and Spectra packages which provides an even better integration with the RforMassSpectrometry R package ecosystem simplifying e.g. also compound annotation [3].

This document describes data import, exploration and preprocessing of a simple test LC-MS data set with the xcms package version >= 4. The same functions can be applied to the older MSnbase-based workflows (xcms version 3). Additional documents and tutorials covering also other topics of untargeted metabolomics analysis are listed at the end of this document. There is also a xcms tutorial available with more examples and details.

2 Preprocessing of LC-MS data

2.1 Data import

xcms supports analysis of any LC-MS(/MS) data that can be imported with the Spectra package. Such data will typically be provided in (AIA/ANDI) NetCDF, mzXML and mzML format but can, through dedicated extensions to the Spectra package, also be imported from other sources, e.g. also directly from raw data files in manufacturer’s formats.

For demonstration purpose we will analyze in this document a small subset of the data from [4] in which the metabolic consequences of the knock-out of the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided through the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion polarity from 200-600 m/z and 2500-4500 seconds. To speed-up processing of this vignette we will restrict the analysis to only 8 files.

Below we load all required packages, locate the raw CDF files within the faahKO package and build a phenodata data.frame describing the experimental setup. Generally, such data frames should contain all relevant experimental variables and sample descriptions (including also the names of the raw data files) and will be imported into R using either the read.table() function (if the file is in csv or tabulator delimited text file format) or also using functions from the readxl R package if it is in Excel file format.

library(xcms)
library(faahKO)
library(RColorBrewer)
library(pander)
library(pheatmap)
library(MsExperiment)

## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
            recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
## Create a phenodata data.frame
pd <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
                                   replacement = "", fixed = TRUE),
                 sample_group = c(rep("KO", 4), rep("WT", 4)),
                 stringsAsFactors = FALSE)

We next load our data using the readMsExperiment function from the MsExperiment package.

faahko <- readMsExperiment(spectraFiles = cdfs, sampleData = pd)
faahko
## Object of class MsExperiment 
##  Spectra: MS1 (10224) 
##  Experiment data: 8 sample(s)
##  Sample data links:
##   - spectra: 8 sample(s) to 10224 element(s).

The MS spectra data from our experiment is now available as a Spectra object within faahko. Note that this MsExperiment container could in addition to spectra data also contain other types of data or also references to other files. See the vignette from the MsExperiment for more details. Also, when loading data from mzML, mzXML or CDF files, by default only general spectra data is loaded into memory while the actual peaks data, i.e. the m/z and intensity values are only retrieved on-the-fly from the raw files when needed (this is similar to the MSnbase on-disk mode described in [2]). This guarantees a low memory footprint hence allowing to analyze also large experiments without the need of high performance computing environments. Note that also different alternative backends (and hence data representations) could be used for the Spectra object within faahko with eventually even lower memory footprint, or higher performance. See the package vignette from the Spectra package or the SpectraTutorials tutorial for more details on Spectra backends and how to change between them.

2.2 Initial data inspection

The MsExperiment object is a simple and flexible container for MS experiments. The raw MS data is stored as a Spectra object that can be accessed through the spectra() function.

spectra(faahko)
## MSn data (Spectra) with 10224 spectra in a MsBackendMzR backend:
##         msLevel     rtime scanIndex
##       <integer> <numeric> <integer>
## 1             1   2501.38         1
## 2             1   2502.94         2
## 3             1   2504.51         3
## 4             1   2506.07         4
## 5             1   2507.64         5
## ...         ...       ...       ...
## 10220         1   4493.56      1274
## 10221         1   4495.13      1275
## 10222         1   4496.69      1276
## 10223         1   4498.26      1277
## 10224         1   4499.82      1278
##  ... 33 more variables/columns.
## 
## file(s):
## ko15.CDF
## ko16.CDF
## ko21.CDF
##  ... 5 more files

All spectra are organized sequentially (i.e., not by file) but the fromFile() function can be used to get for each spectrum the information to which of the data files it belongs. Below we simply count the number of spectra per file.

table(fromFile(faahko))
## 
##    1    2    3    4    5    6    7    8 
## 1278 1278 1278 1278 1278 1278 1278 1278

Information on samples can be retrieved through the sampleData() function.

sampleData(faahko)
## DataFrame with 8 rows and 3 columns
##   sample_name sample_group spectraOrigin
##   <character>  <character>   <character>
## 1        ko15           KO /home/bioc...
## 2        ko16           KO /home/bioc...
## 3        ko21           KO /home/bioc...
## 4        ko22           KO /home/bioc...
## 5        wt15           WT /home/bioc...
## 6        wt16           WT /home/bioc...
## 7        wt21           WT /home/bioc...
## 8        wt22           WT /home/bioc...

Each row in this DataFrame represents one sample (input file). Using [ it is possible to subset a MsExperiment object by sample. Below we subset the faahko to the 3rd sample (file) and access its spectra and sample data.

faahko_3 <- faahko[3]
spectra(faahko_3)
## MSn data (Spectra) with 1278 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1   2501.38         1
## 2            1   2502.94         2
## 3            1   2504.51         3
## 4            1   2506.07         4
## 5            1   2507.64         5
## ...        ...       ...       ...
## 1274         1   4493.56      1274
## 1275         1   4495.13      1275
## 1276         1   4496.69      1276
## 1277         1   4498.26      1277
## 1278         1   4499.82      1278
##  ... 33 more variables/columns.
## 
## file(s):
## ko21.CDF
sampleData(faahko_3)
## DataFrame with 1 row and 3 columns
##   sample_name sample_group spectraOrigin
##   <character>  <character>   <character>
## 1        ko21           KO /home/bioc...

As a first evaluation of the data we below plot the base peak chromatogram (BPC) for each file in our experiment. We use the chromatogram() method and set the aggregationFun to "max" to return for each spectrum the maximal intensity and hence create the BPC from the raw data. To create a total ion chromatogram we could set aggregationFun to "sum".

## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(faahko, aggregationFun = "max")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")

## Plot all chromatograms.
plot(bpis, col = group_colors[sampleData(faahko)$sample_group])