## Warning: replacing previous import 'BiocGenerics::var' by 'stats::var' when
## loading 'MLInterfaces'

Foreword

MSnbase is under active development; current functionality is evolving and new features will be added. This software is free and open-source software. If you use it, please support the project by citing it in publications:

Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/MSnbase/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction

MSnbase (L. Gatto and Lilley 2012) aims are providing a reproducible research framework to proteomics data analysis. It should allow researcher to easily mine mass spectrometry data, explore the data and its statistical properties and visually display these.

MSnbase also aims at being compatible with the infrastructure implemented in Bioconductor, in particular Biobase. As such, classes developed specifically for proteomics mass spectrometry data are based on the eSet and ExpressionSet classes. The main goal is to assure seamless compatibility with existing meta data structure, accessor methods and normalisation techniques.

This vignette illustrates MSnbase utility using a dummy data sets provided with the package without describing the underlying data structures. More details can be found in the package, classes, method and function documentations. A description of the classes is provided in the MSnbase-development vignette1 in R, open it with vignette("MSnbase-development") or read it online here.

1.1 Speed and memory requirements

Raw mass spectrometry file are generally several hundreds of MB large and most of this is used for binary raw spectrum data. As such, data containers can easily grow very large and thus require large amounts of RAM. This requirement is being tackled by avoiding to load the raw data into memory and using on-disk random access to the content of mzXML/mzML data files on demand. When focusing on reporter ion quantitation, a direct solution for this is to trim the spectra using the trimMz method to select the area of interest and thus substantially reduce the size of the Spectrum objects. This is illustrated in section 6.2.

The independent handling of spectra is ideally suited for parallel processing. The quantify method now performs reporter peaks quantitation in parallel. More functions are being updated.

Finally, recent developmenets in version 2 of the package have solved the memory issue by implementing and on-disk version the of data class storing raw data (MSnExp, see section 2.3), where the spectra a accessed on-disk only when required. The benchmarking vignette compares the on-disk and in-memory implemenatations2 in R, open it with vignette("benchmarking") or read it online here.

2 Data structure and content

2.1 Importing experiments

MSnbase is able to import raw MS data stored in one of the XML-based formats as well as peak lists in the mfg format3 Mascot Generic Format, see http://www.matrixscience.com/help/data_file_help.html#GEN.

Raw data The XML-based formats, mzXML (Pedrioli et al. 2004), mzData (Orchard et al. 2007) and mzML (Martens et al. 2010) can be imported with the readMSData function, as illustrated below (see ?readMSData for more details). To make use of the new on-disk implementation, set mode = "onDisk" in readMSData rather than using the default mode = "inMemory".

file <- dir(system.file(package = "MSnbase", dir = "extdata"),
            full.names = TRUE, pattern = "mzXML$")
rawdata <- readMSData(file, msLevel = 2, verbose = FALSE)

Only spectra of a given MS level can be loaded at a time by setting the msLevel parameter accordingly in readMSData and in-memory data. In this document, we will use the itraqdata data set, provided with MSnbase. It includes feature metadata, accessible with the fData accessor. The metadata includes identification data for the 55 MS2 spectra.

MSnbase 2.0 Version 2.0 and later of MSnbase use a new on-disk data storage model (see the benchmarking vignette for more details). The new data backend is compatible with the orignal in-memory model. To make use of the new infrastructure, read your raw data by setting the mode argument to "onDisk" (the default in "inMemory"). The new on-disk implementation supports several MS levels in a single raw data object. All existing operations work irrespective of the backend.

Peak lists Peak lists can often be exported after spectrum processing from vendor-specific software and are also used as input to search engines. Peak lists in mgf format can be imported with the function readMgfData (see ?readMgfData for details) to create experiment objects. Experiments or individual spectra can be exported to an mgf file with the writeMgfData methods (see ?writeMgfData for details and examples).

Experiments with multiple runs Although it is possible to load and process multiple files serially and later merge the resulting quantitation data as show in section 13, it is also feasible to load several raw data files at once. Here, we report the analysis of an LC-MSMS experiment were 14 liquid chromatography (LC) fractions were loaded using readMSData on a 32-cores servers with 128 Gb of RAM. It took about 90 minutes to read the 14 uncentroided mzXML raw files (4.9 Gb on disk in total) and create a 3.3 Gb raw data object (an MSnExp instance, see next section). Quantitation of 9 reporter ions (iTRAQ9 object, see 2.5) for 88690 features was performed in parallel on 16 processors4 Parallel support is provided by the