This vignette describes the functionality implemented in the MSnbase package. MSnbase aims at (1) facilitating the import, processing, visualisation and quantification of mass spectrometry data into the R environment (R Development Core Team 2011) by providing specific data classes and methods and (2) enabling the utilisation of throughput-high data analysis pipelines provided by the Bioconductor (Gentleman et al. 2004) project.
MSnbase 2.31.1
This software is free and open-source software. If you use it, please support the project by citing it in publications:
Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.
MSnbase
, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. Laurent Gatto, Sebastian Gibb, Johannes Rainer. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868
For bugs, typos, suggestions or other questions, please file an issue
in our tracking system (https://github.com/lgatto/MSnbase/issues)
providing as much information as possible, a reproducible example and
the output of sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
MSnbase (Gatto and Lilley 2012) aims are providing a reproducible research framework to proteomics data analysis. It should allow researcher to easily mine mass spectrometry data, explore the data and its statistical properties and visually display these.
MSnbase also aims at being compatible with the infrastructure implemented in Bioconductor, in particular Biobase. As such, classes developed specifically for proteomics mass spectrometry data are based on the eSet and ExpressionSet classes. The main goal is to assure seamless compatibility with existing meta data structure, accessor methods and normalisation techniques.
This vignette illustrates MSnbase utility using a dummy
data sets provided with the package without describing the underlying
data structures. More details can be found in the package, classes,
method and function documentations. A description of the classes is
provided in the MSnbase-development
vignette1 in R, open it with vignette("MSnbase-development")
or read it online here.
Raw mass spectrometry file are generally several hundreds of MB large
and most of this is used for binary raw spectrum data. As such, data
containers can easily grow very large and thus require large amounts
of RAM. This requirement is being tackled by avoiding to load the raw
data into memory and using on-disk random access to the content of
mzXML
/mzML
data files on demand. When focusing on reporter ion
quantitation, a direct solution for this is to trim the spectra using
the trimMz
method to select the area of interest and thus
substantially reduce the size of the Spectrum
objects. This is
illustrated in section 6.2.
Parallel processing The independent handling of spectra is ideally
suited for parallel processing. The quantify
method for example
performs reporter peaks quantitation in parallel.
Parallel support is provided by the BiocParallel and
various backends including multicore (forking, default on Linux),
simple networf network of workstations (SNOW, default on Windows)
using sockets, forking or MPI among others. We refer readers to the
documentation in BiocParallel. Automatic parallel
processing of spectra is only established for a certain number of
spectra (per file). This value (default is 1000) can be set with the
setMSnbaseParallelThresh
function.
In sock-based parallel processing, the main worker process has to
start new R instances and connect to them via sock. Sometimes these
connections can not be established and the processes get stuck. To
test this, users can disable parallel processing by disabling parallel
processing with register(SerialParam())
. To avoid these deadlocks,
it is possible to initiate the parallel processing setup explicitly at
the beginning of the script using, for example
library("doParallel")
registerDoParallel(3) ## using 3 slave nodes
register(DoparParam(), default = TRUE)
## rest of script comes below
On-disk access Developmenets in version 2 of the package have
solved the memory issue by implementing and on-disk version the of
data class storing raw data (MSnExp, see section 2.3),
where the spectra a accessed on-disk only when required. The
benchmarking vignette compares the on-disk and in-memory
implemenatations2 in R, open it with vignette("benchmarking")
or
read it online
here. See
details below.
MSnbase is able to import raw MS data stored in one of
the XML
-based formats as well as peak lists in the mfg
format3 Mascot Generic Format, see http://www.matrixscience.com/help/data_file_help.html#GEN.
Raw data The XML
-based formats, mzXML
(Pedrioli et al. 2004),
mzData
(Orchard et al. 2007) and mzML
(Martens et al. 2010) can be imported with
the readMSData
function, as illustrated below (see ?readMSData
for
more details). To make use of the new on-disk implementation, set
mode = "onDisk"
in readMSData
rather than using the default mode = "inMemory"
.
file <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.names = TRUE, pattern = "mzXML$")
rawdata <- readMSData(file, msLevel = 2, verbose = FALSE)
Only spectra of a given MS level can be loaded at a time by setting
the msLevel
parameter accordingly in readMSData
and in-memory
data. In this document, we will use the itraqdata
data set, provided
with MSnbase. It includes feature metadata, accessible
with the fData
accessor. The metadata includes identification data
for the 55 MS2 spectra.
Version 2.0 and later of MSnbase provide a new
on-disk data storage model (see the benchmarking vignette for more
details). The new data backend is compatible with the orignal
in-memory model. To make use of the new infrastructure, read your
raw data by setting the mode
argument to "onDisk"
(the default is
still "inMemory"
but is likely to change in the future). The new
on-disk implementation supports several MS levels in a single raw
data object. All existing operations work irrespective of the backend.
Peak lists can often be exported after spectrum processing from
vendor-specific software and are also used as input to search engines.
Peak lists in mgf
format can be imported with the function
readMgfData
(see ?readMgfData
for details) to create experiment
objects. Experiments or individual spectra can be exported to an
mgf
file with the writeMgfData
methods (see ?writeMgfData
for
details and examples).
Experiments with multiple runs Although it is possible to load and
process multiple files serially and later merge the resulting
quantitation data as show in section 13, it is also
feasible to load several raw data files at once. Here, we report the
analysis of an LC-MSMS experiment were 14 liquid chromatography (LC)
fractions were loaded in memory using readMSData
on a 32-cores
servers with 128 Gb of RAM. It took about 90 minutes to read the 14
uncentroided mzXML
raw files (4.9 Gb on disk in total) and create a
3.3 Gb raw data object (an MSnExp instance, see next section).
Quantitation of 9 reporter ions (iTRAQ9 object, see
2.5) for 88690 features was performed in parallel
on 16 processors and took 76 minutes. The resulting quantitation data
was only 22.1 Mb and could easily be further processed. These number
are based on the older in-memory implementation. As shown in the
benchmarking vignette, using on-disk data greatly reduces memory
requirement and computation time.
See also section 7.2 to import quantitative data stored in
spreadsheets into R for further processing using MSnbase.
The MSnbase-iovignette[in R, open it with vignette("MSnbase-io")
or read it online
here]
gives a general overview of MSnbase’s input/ouput
capabilites.
See section 7.3 for importing chromatographic data of SRM/MRM experiments.
MSnbase
supports also to write MSnExp
or OnDiskMSnExp
objects to mzML
or
mzXML
files using the writeMSData
function. This is specifically useful in
workflows in which the MS data was heavily manipulated. Presently, each
sample/file is exported into one file.
Below we write the data in mzML
format to a temporary file. By setting the
optional parameter copy = TRUE
general metadata (such as instrument info or
all data processing descriptions) are copied over from the originating file.
writeMSData(rawdata, file = paste0(tempfile(), ".mzML"), copy = TRUE)
Raw data is contained in MSnExp objects, that stores all the spectra of an experiment, as defined by one or multiple raw data files.
library("MSnbase")
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 55
## MSn retention times: 19:09 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## MSnbase version: 1.1.22
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames sampleNumbers
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1 X10 ... X9 (55 total)
## fvarLabels: spectrum ProteinAccession ProteinDescription
## PeptideSequence
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
head(fData(itraqdata))
## spectrum ProteinAccession ProteinDescription
## X1 1 BSA bovine serum albumin
## X10 10 ECA1422 glucose-1-phosphate cytidylyltransferase
## X11 11 ECA4030 50S ribosomal subunit protein L4
## X12 12 ECA3882 chaperone protein DnaK
## X13 13 ECA1364 succinyl-CoA synthetase alpha chain
## X14 14 ECA0871 NADP-dependent malic enzyme
## PeptideSequence
## X1 NYQEAK
## X10 VTLVDTGEHSMTGGR
## X11 SPIWR
## X12 TAIDDALK
## X13 SILINK
## X14 DFEVVNNESDPR
As illustrated above, showing the experiment textually displays it’s content:
Information about the raw data, i.e. the spectra.
Specific information about the experiment
processing4 This part will be automatically updated when the object is modified with it’s ad hoc methods, as illustrated later.
and package version. This slot can be accessed with the
processingData
method.
Other meta data, including experimental phenotype, file name(s) used
to import the data, protocol data, information about features
(individual spectra here) and experiment data. Most of these are
implemented as in the eSet class and are described in more details
in their respective manual pages. See ?MSnExp
and references
therein for additional background information.
The experiment meta data associated with an MSnExp experiment is
of class MIAPE. It stores general information about the experiment
as well as MIAPE (Minimum Information About a Proteomics Experiment)
information (Taylor et al. 2007, @Taylor2008). This meta-data can be
accessed with the experimentData
method. When available, a
summary of MIAPE-MS data can be printed with the msInfo
method.
See ?MIAPE
for more details.
The raw data is composed of the 55 MS spectra. The
spectra are named individually
(X1, X10, X11, X12, X13, X14, …)
and stored in a environment
. They can be accessed individually with
itraqdata[["X1"]]
or itraqdata[[1]]
, or as a list with
spectra(itraqdata)
. As we have loaded our experiment specifying
msLevel=2
, the spectra will all be of level 2 (or higher, if
available).
sp <- itraqdata[["X1"]]
sp
## Object of class "Spectrum2"
## Precursor: 520.7833
## Retention time: 19:09
## Charge: 2
## MSn level: 2
## Peaks count: 1922
## Total ion count: 26413754
Attributes of individual spectra or of all spectra of an experiment
can be accessed with their respective methods:
precursorCharge
for the precursor charge,
rtime
for the retention time, mz
for the MZ
values, intensity
for the intensities, … see the
Spectrum, Spectrum1 and Spectrum2
manuals for more details.
peaksCount(sp)
## [1] 1922
head(peaksCount(itraqdata))
## X1 X10 X11 X12 X13 X14
## 1922 1376 1571 2397 2574 1829
rtime(sp)
## [1] 1149.31
head(rtime(itraqdata))
## X1 X10 X11 X12 X13 X14
## 1149.31 1503.03 1663.61 1663.86 1664.08 1664.32
Reporter ions are defined with the ReporterIons class.
Specific peaks of interest are defined by a MZ value, a with around
the expected MZ and a name (and optionally a colour for plotting, see
section 3). ReporterIons instances are
required to quantify reporter peaks in MSnExp
experiments. Instances for the most commonly used isobaric tags like
iTRAQ 4-plex and 8-plex and TMT 6- and 10-plex tags are already
defined in MSnbase. See ?ReporterIons
for
details about how to generate new ReporterIons objects.
iTRAQ4
## Object of class "ReporterIons"
## iTRAQ4: '4-plex iTRAQ' with 4 reporter ions
## - [iTRAQ4.114] 114.1112 +/- 0.05 (red)
## - [iTRAQ4.115] 115.1083 +/- 0.05 (green)
## - [iTRAQ4.116] 116.1116 +/- 0.05 (blue)
## - [iTRAQ4.117] 117.115 +/- 0.05 (yellow)
TMT16
## Object of class "ReporterIons"
## TMT16HCD: '16-plex TMT HCD' with 16 reporter ions
## - [126] 126.1277 +/- 0.002 (#E55400)
## - [127N] 127.1248 +/- 0.002 (#E28100)
## - [127C] 127.1311 +/- 0.002 (#DFAC00)
## - [128N] 128.1281 +/- 0.002 (#DDD700)
## - [128C] 128.1344 +/- 0.002 (#B5DA00)
## - [129N] 129.1315 +/- 0.002 (#87D8000)
## - [129C] 129.1378 +/- 0.002 (#5BD500)
## - [130N] 130.1348 +/- 0.002 (#30D300)
## - [130C] 130.1411 +/- 0.002 (#05D000)
## - [131N] 131.1382 +/- 0.002 (#00CE23)
## - [131C] 131.1445 +/- 0.002 (#00CB4B)
## - [132N] 132.1415 +/- 0.002 (#00C972)
## - [132C] 132.1479 +/- 0.002 (#00C699)
## - [133N] 133.1449 +/- 0.002 (#00C4BE)
## - [133C] 133.1512 +/- 0.002 (#00A0C1)
## - [134N] 134.1482 +/- 0.002 (#0078BF)
Chromatographic data, i.e. intensity values along the retention time dimension
for a given \(m/z\) range/slice, can be extracted with the chromatogram
method. Below we read a file from the msdata
package and extract the (MS level
1) chromatogram. Without providing an \(m/z\) and a retention time range the
function returns the total ion chromatogram (TIC) for each file within the
MSnExp
or OnDiskMSnExp
object. See also section 7.3 for importing
chromatographic data from SRM/MRM experiments.
f <- c(system.file("microtofq/MM14.mzML", package = "msdata"))
mtof <- readMSData(f, mode = "onDisk")
mtof_tic <- chromatogram(mtof)
mtof_tic
## MChromatograms with 1 row and 1 column
## MM14.mzML
## <Chromatogram>
## [1,] length: 112
## phenoData with 1 variables
## featureData with 1 variables
Chromatographic data, represented by the intensity-retention time duplets, is
stored in the Chromatogram
object. The chromatogram
method returns a
Chromatograms
object (note the s) which holds multiple Chromatogram
objects and arranges them in a two-dimensional grid with columns representing
files/samples of the MSnExp
or OnDiskMSnExp
object and rows \(m/z\)-retention
time ranges. In the example above the Chromatograms
object contains only a
single Chromatogram
object. Below we access this chromatogram object. Similar
to the Spectrum
objects, Chromatogram
objects provide the accessor functions
intensity
and rtime
to access the data, as well as the mz
function, that
returns the \(m/z\) range of the chromatogram.
mtof_tic[1, 1]
## Object of class: Chromatogram
## Intensity values aggregated using: sum
## length of object: 112
## from file: 1
## mz range: [94.80679, 1004.962]
## rt range: [270.334, 307.678]
## MS level: 1
head(intensity(mtof_tic[1, 1]))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006
## 64989 67445 77843 105097 155609 212760
head(rtime(mtof_tic[1, 1]))
## F1.S001 F1.S002 F1.S003 F1.S004 F1.S005 F1.S006
## 270.334 270.671 271.007 271.343 271.680 272.016
mz(mtof_tic[1, 1])
## [1] 94.80679 1004.96155
To extract the base peak chromatogram (the largest peak
along the \(m/z\) dimension for each retention time/spectrum) we set the
aggregationFun
argument to "max"
.
mtof_bpc <- chromatogram(mtof, aggregationFun = "max")
See the Chromatogram
help page and the vignettes from the xcms
package for more details and use cases, also on how to extract
chromatograms for specific ions.
The MSmap class can be used to isolate specific slices of
interest from a complete MS acquisition by specifying \(m/z\) and
retention time ranges. One needs a raw data file in a format supported
by mzR’s openMSfile
(mzML
,
mzXML
, …). Below we first download a raw data file from the
PRIDE repository and create an MSmap containing all the MS1 spectra
between acquired between 30 and 35 minutes and peaks between 521 and
523 \(m/z\). See ?MSmap
for details.
## downloads the data
library("rpx")
px1 <- PXDataset("PXD000001")
mzf <- pxget(px1, 7)
## reads the data
ms <- openMSfile(mzf)
hd <- header(ms)
## a set of spectra of interest: MS1 spectra eluted
## between 30 and 35 minutes retention time
ms1 <- which(hd$msLevel == 1)
rtsel <- hd$retentionTime[ms1] / 60 > 30 &
hd$retentionTime[ms1] / 60 < 35
## the map
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd, zeroIsNA = TRUE)
M
## Object of class "MSmap"
## Map [75, 401]
## [1] Retention time: 30:01 - 34:58
## [2] M/Z: 521 - 523 (res 0.005)
The M
map object can be rendered as a heatmap with plot
, as shown
on figure 1.
plot(M, aspect = 1, allTicks = FALSE)
One can also render the data in 3 dimension with the plot3D
function, as show on figure 2.
plot3D(M)