RMassBank is a two-part computational mass spectrometry workflow:
This vignette describes basic usage with the standard workflow. The package is flexible and allows for different advanced use cases. Examples of specialized applications of RMassBank are available at the RMassBank message board hosted by the Metabolomics-Forum: (http://www.metabolomics-forum.com/viewforum.php?f=29).
The library is available from Bioconductor ((http://www.bioconductor.org)). In addition to the library itself, it is recommended to install the OpenBabel chemical toolkit, available from (http://www.openbabel.org) for various platforms (or via Linux package distribution systems).
The library is loaded as follows
library(RMassBank)
## Loading required package: Rcpp
The data used in the following example is available as a package RMassBankData, which must be installed separately and is loaded using
library(RMassBankData)
RMassBank handles high-resolution LC/MS spectra in mzML format in centroid1 The term “centroid” here refers to any kind of data which are not in profile mode, i.e. don’t have continuous m/z data. It does not refer to the (mathematical) centroid peak, i.e. the area-weighted mass peak. or in profile mode.
Data in the examples was acquired using an LTQ Orbitrap XL instrument in profile mode, and converted from profile-mode RAW into centroid-mode mzML using MSConvertGUI from ProteoWizard. The settings were as shown in the screenshot below (note the “Peak Picking” filter.)
Fig. 1: ProteoWiz settings for conversion to mzML
In the standard workflow, the file names are used to identify a
compound: file names must be in the format xxxxxxxx_1234_xxx.mzXML
,
where the xxx parts denote anything and the 1234 part denotes the compound ID in
the compound list (see below). Advanced and alternative uses can be implemented;
consult the implementation of msms_workflow
and findMsMsHR
for
more information.
A compound list in CSV format is required to identify all compounds unambiguously.
The CSV file is required to have at least the following columns, which are used for
further processing and must be named correctly (but present in any order): ID, Name, SMILES, RT, CAS
. The columns ID
and SMILES
must be filled, the other columns
must be present in the file but do not need to be filled.
ID
specifies an (arbitrary) numeric ID code which must be < 4 digits long; SMILES
specifies
a SMILES code with the chemical structure of the compound (and is used to extract the
molecular formula, for calculation of molecular masses, for database searching in CTS etc.)
Although the columns Name, RT, CAS
have to be present, the
information in the columns is only used if the cells are filled.
RT, if present, specifies the retention time (in minutes; \(\pm\) a window specified in the RMassBank options, see below)
where a LC/MS file is searched for the compound spectra. CAS
and Name
are used as additional information while retrieving annotations from CTS. The
compound list doesn’t have to be ordered in any particular way. It can contain large numbers of compounds,
even compounds which will not be actively used by the script (Note: Unused compounds
don’t require a SMILES code, since they will not be accessed.)
An example list is provided with the RMassBankData package, and can be copied into a local folder, viewed and edited:
file.copy(system.file("list/NarcoticsDataset.csv",
package="RMassBankData"), "./Compoundlist.csv")
## [1] TRUE
A number of different settings influence RMassBank. They are partly parameters for data processing and partly constants used for annotation.
A settings template file, to be edited by hand, can be generated using
RmbSettingsTemplate("mysettings.ini")
where mysettings.ini
is the file that will be generated. This file
should then be edited. Important settings are:
deprofile
: Whether to use a deprofiling algorithm to work
with profile-mode data. Default is NA
for use with centroid-mode
data. Allowed settings for profile-mode data include deprofile.fwhm
(full-width half-maximum
algorithm), deprofile.spline
(cubic spline algorithm),
deprofile.localmax
(local maximum). See the respective help pages for
detailed information.rtMargin
: The deviation allowed for retention times (in minutes) when
extracting spectra from raw data files.rtShift
: The systematic retention time shift (in minutes) in
the LC-MS data compared to the values in the compound list.babeldir
: The directory pointing to the OpenBabel binaries.use_version
: which MassBank data format to use. The default is the newer version 2; alternatively, the (deprecated) version 1 can be specified for MassBank servers running old versions of the server software.use_rean_peaks
: Whether or not peaks from reanalysis should be used (see below for details.)add_annotation
: Whether or not fragments should be annotated
with the (tentative) molecular formula in MassBank records.annotations
: A list of annotation data used in the MassBank records.
authors
, copyright
, publication
, license
, instrument
, instrument_type
, compound_class
: values for the corresponding MassBank fieldsconfidence_comment
: A commentary field about “compound confidence” which is added like “COMMENT: CONFIDENCE standard compound” in the MassBank record.internal_id_fieldname
: The name for an internal ID field in the MassBank record where to store the compound ID (in the compound list). For internal_id_fieldname
= “MY_ID”, the ID will be stored like “COMMENT: MY_ID 1234”.entry_prefix
: The prefix for MassBank accession IDs.ms_type
, ionization
, lc_*
: Annotations for the LC and MS information fields in the MassBank records.ms_dataprocessing
: Tags added to describe the data
processing.
In addition to the tags specified here, MS$DATA_PROCESSING:
WHOLE RMassBank will be added (corresponding to a list(“WHOLE” = “RMassBank”) entry for this option.)annotator
: For advanced users: option to select your own custom annotator. Check ?annotator.default and the source code for details.spectraList
: The list of data-dependent scans triggered by a MS1 scan in their order; used for annotation of MassBank records. See the template file for description.accessionBuilderType
: A string (either “standard”, “simple” or “selfDefined”) to determine how to generate MassBank record accession numbers (optional, default: “standard”). RMassBank generates an accession number for each record. The structure and generation of this number varies based on accessionBuilderType
.
annotations$entry_prefix
, the first four digits are given by the compound ID. The last two digits are generated from the position of the spectrum in spectraList
and the shift defined in accessionNumberShifts
for the selected ion type (Example: the compound with ID 2112, processed in “pNa” mode ([M+Na]+), will have accession numbers XX211233, XX211234 … etc in for the first, second… spectrum in the data-dependent scan, if the “pNa” shift is set to 32.)annotations$entry_prefix
, the 6 digit code is generated from the position of the spectrum in spectraList
and the shift given in accessionNumberStart
. Leading zeros are added if necessary. (Example: accession numbers XX000043, XX000045 … will be generated for the first, second … spectrum in the data-dependent scan if accessionNumberStart
is set to 32.)accessionBuilderFile
. In particular, there is no constraint on the prefix and annotations$entry_prefix
will be ignored, if this option is chosen. The function definition must be in the form accessionBuilder <- function(cpd, spectrm, subscan)
.
Note: This functionality is quite advanced. If you really want to specify your own accessionBuilder
instead of using the “simple” or “standard” option, we highly encourage you to familiarize yourself with the source code of the function .buildRecord.RmbSpectraSet
in buildRecord.R
first.accessionNumberShifts
: A list defining the starting points
for generating MassBank record accession numbers. This will be used if accessionBuilderType
is unspecified or “standard” (see accessionBuilderType
above).accessionBuilderFile
: A file with a user-defined function to generate MassBank record accession numbers. This will be used if accessionBuilderType
is “selfDefined” (see accessionBuilderType
above.)accessionNumberStart
: An integer < 1000000 defining the starting point of MassBank record accession numbers. This will be used if accessionBuilderType
is “simple”. (see accessionBuilderType
above).project
: A string giving the project tag, optional. If present, this will be inclueded in the PROJECT
field of the record.recalibrateBy
: Which parameter to use for recalibration:
dppm
(recalibrate the deviation in ppm) or dmz
(recalibrate the m/z deviation).recalibrateMS1
: Whether to recalibrate MS1 data points
separately from MS2 data points ("separate"
), with the same
recalibration curve as the MS2 data points ("common"
) or not at all
("none"
). Note that the MS1 datapoints points will be used to
generate the MS2 recalibration curve in all cases (since this makes the
recalibration curve in high-m/z regions better-defined) but may be
recalibrated independently themselves, if desired.recalibrator
: Sets the functions to use for recalibration.
Defaults to list(MS1="recalibrate.loess", MS2="recalibrate.loess")
which uses a Loess non-parametric fit to generate a recalibration curve. Any
custom function may be specified. The function is expected to accept a
dataset with variables recalfield
and mzFound
and to return an
object which can be used with predict()
. The input recalfield
is the value to be estimated by recalibration - it will either contain delta
ppm values or absolute mass deviations, depending on the setting for
recalibrateBy
. In addition to recalibrate.loess
,
recalibrate.MS1
is predefined, which uses a GAM model for
recalibration and appears to work well for pure MS1 datapoints. However,
common recalibration for MS1 and MS2 appears to be the best option in
general.multiplicityFilter
: Define the multiplicity filtering level. Default is 2, a value of 1 is off (no filtering) and >2 is harsher filtering.titleFormat
: The title of MassBank records is a mini-summary
of the record, for example “Dinotefuran; LC-ESI-QFT; MS2; CE: 35%; R=35000; [M+H]+”.
By default, the first compound name CH$NAME
, instrument type
AC$INSTRUMENT_TYPE
, MS/MS type AC$MASS_SPECTROMETRY: MS_TYPE
,
collision energy RECORD_TITLE_CE
, resolution AC$MASS_SPECTROMETRY: RESOLUTION
and precursor MS$FOCUSED_ION: PRECURSOR_TYPE
are used. If alternative
information is relevant to differentiate acquired spectra, the title should be adjusted.
For example, many TOFs do not have a resolution setting. See MassBank documentation for more.filterSettings
: A list of settings that affect the MS/MS processing.
ppmHighMass
, ppmLowMass
: values for pre-processing,
prior to recalibration. The default settings (for e.g. Orbitrap) is 10 ppm
for high mass range, 15 ppm for low mass range (defined by massRangeDivision
)massRangeDivision
: The m/z value defining the split between
ppmHighMass
and ppmLowMass
above. The default m/z 120 is
recommended for Orbitraps.ppmFine
: This defines the ppm cut-off post recalibration.
The default value of 5 ppm is recommended for Orbitraps.prelimCut
, prelimCutRatio
: Intensity cut-off and cut-off ratio
(in % of the most intense peak) for pre-processing. Affects peak selection
for the recalibration only. Careful: the default 1e4 for Orbitrap LTQ positive could
remove all peaks for TOF data and will remove too many peaks for Orbitrap LTQ
negative mode spectra!specOkLimit
: MS/MS must have at least one peak above this limit
present to be processed.dbeMinLimit
: The minimum allowable ring and double bond equivalent (DBE)
allowed for assigned formulas. Assumes maximum valences for elements with multiple
possible valences. Default is -0.5 (accounting for fragment peaks being ions).satelliteMzLimit
, satelliteIntLimit
: Cut-off m/z and
intensity values for satellite peak removal. All peaks within the m/z (default 0.5)
and intensity ratio (default 0.05 or 5 %) of the respective peak will be removed.
Applicable to Fourier Transform instruments (e.g. Orbitrap).findMsMsRawSettings
: Parameters for adjusting the raw data retrieval.
ppmFine
: The ppm error to look for the precursor in the MS1 (parent)
spectrum. Default is 10 ppm for Orbitrap.mzCoarse
: The error to search for the precursor specification in the
MS2 spectrum. This is often only saved to 2 decimal places and thus inaccurate and
may also depend on the isolation window.
The default settings (for e.g. Orbitrap) is m/z=0.5 for mzCoarse
.fillPrecursorScan
: The default value (FALSE) assumes all
necessary precursor information was available in the mzML file. A setting of
TRUE tries to fill in the precursor data scan number if it is missing.logging_file
: Set a file logs should be written to. By default, logging_file
is not specified and all logging information is written to STDOUT. Note: This setting will cause a static package variable to contain the logging file. This variable is checked by the logging functions, rather than the setting. Hence, changing the setting manually afterwards will not change the logging file.See also the manpage ?RmbSettings
for a description of all RMassBank
settings.
In the first part of the workflow, spectra are extracted from the files and processed. In the following example, we will process the narcotics spectra from the RMassBankData package.
For the workflow to work correctly, a settings file (generated as above and edited accordingly) before must be loaded first.
loadRmbSettings("mysettings.ini")
(Note: the template file generated by RmbSettingsTemplate()
has no OpenBabel
directory specified.
Correspondingly, RMassBank will use the CACTUS service instead to generate MOL
files. For your actual use, it is strongly recommended to install OpenBabel and
specify its install directory in the settings! The CACTUS structures are
visually less appealing since they have all hydrogen atoms explicit, and CACTUS
is only a backup solution.)
First, create a workspace for the msmsWorkflow
:
w <- newMsmsWorkspace()
The full paths of the files must be loaded into the container in the array
files
:
files <- list.files(system.file("spectra", package="RMassBankData"),
".mzML", full.names = TRUE)
basename(files)
## [1] "1_3_Chlorophenyl_piperazin_2818_pos.mzML"
## [2] "1_3_Trifluoromethylphenyl_piperazin_2819_pos.mzML"
## [3] "1_Benzylpiperazin_2820_pos.mzML"
## [4] "Amitriptylin_2821_pos.mzML"
## [5] "Amphetamin_2822_pos.mzML"
## [6] "Benzoylecgonin_2823_pos.mzML"
## [7] "Cocain_2817_pos.mzML"
## [8] "Dextromethorphan_2824_pos.mzML"
## [9] "EDDP_2_Ethyl_1_5_dimethyl_3_3_diphenylpyrrolinium_2825_pos.mzML"
## [10] "Ephedrin_2758_pos.mzML"
## [11] "Ketamin_2826_pos.mzML"
## [12] "Mephedron_4_Methylmethcathinon_2827_pos.mzML"
## [13] "Methadon_2828_pos.mzML"
## [14] "Methamphetamin_2829_pos.mzML"
## [15] "Naltrexon_2830_pos.mzML"
# To make the workflow faster here, we use only 2 compounds:
w@files <- files[1:2]
Note the position of the compound IDs in the filenames. Historically, the “pos
” at the end was used to denote the polarity; it is obsolete now, but the ID must be terminated with an underscore.
Additionally, the compound list must be loaded using loadList
(here,
using the formerly copied list from RMassBankData):
loadList("./Compoundlist.csv")
This creates a variable compoundList
in the global environment, which stores the compound data.
Now, we can start the complete workflow to extract [M+H]+ spectral data. The
workflow standard workflow consists of 8 steps.
The argument archivename
specifies the prefix under which to store the analyzed result
files. The argument mode
specifies the processing mode: pH
(positive H)
specifies [M+H]+, pNa
specifies [M+Na]+, pM
specifies [M]+, mH
and
mFA
specify [M-H]- and [M+FA]-, respectively. (I apologize for the naming of pH
which has absolutely nothing to do with chemical pH values.)
Basically, this runs through the entire workflow, which is explained in more detail below:
* Step 1: using the function findMsMsHR
, all the files in files
are searched for MS2 spectra of their respective compound. The found spectra are stored in the array specs
.
* Step 2: A molecular formula fit is attempted for every peak, using the molecular formula of the parent compound as limiting formula, using the function analyzeMsMs
. The results are stored in the array analyzedSpecs
.
* Step 3: The analyzed spectra from the array analyzedSpecs
are aggregated into the list aggregatedSpecs
. This uses the function aggregateSpectra
.
* Step 4: Using the function recalibrateSpectra
, a recalibration curve is calculated from the peaks in aggregatedSpecs
, and all spectra from specs
are recalibrated using this curve. The result is stored in recalibratedSpecs
. The recalibration curve is stored in rc
.
* Step 5: The recalibrated spectra (recalibratedSpecs
) are re-analyzed with analyzeMsMs
and the results stored in analyzedRcSpecs
.
* Step 6: The reanalyzed recalibrated spectra are aggregated with aggregateSpectra
into aggregatedRcSpecs
. Unmatched peaks in aggregatedRcSpecs
are cleaned from known electronic noise using cleanElnoise
. A backup copy of all present results is saved as archivename``.RData
.
* Step 7: Using reanalyzeFailpeaks
, all unmatched peaks from spectra in aggregatedRcSpecs
are reanalyzed, allowing \(N_2O\) as additional elements (to account for oxidation products and \(N_2\) adducts). The results are stored in reanalyzedRcSpecs
. A backup copy of all present results is saved as archivename``_RA.RData
* Step 8: The function filterMultiplicity
is applied to the peaks: Peaks which occur only once in all analyzed spectra of a compound are eliminated. The filtered list is stored under refilteredSpecs
, and a final version of all results is saved as archivename``_RF.RData
. Additionally, filterMultiplicity
creates a CSV file with a list of (relatively) high-intensity unassigned peaks with the name archivename``_Failpeaks.csv
, which should be manually checked. Peaks to include must be marked with OK = 1.
The steps can be called individually using the steps
parameter of msms_workflow
.
Using the newRecalibration
parameter, one can specify if RMassBank should do a new
recalibration (default, TRUE
) or use the recalibration curve stored in rc
(FALSE
). This is useful for re-using a recalibration curve in the reanalysis of the same
data in another mode: After the detection and processing of all [M+H]+ spectra, which will be
present for a large number of compounds, one can rerun the workflow with newRecalibration = F, mode="pNa"
and reuse the same calibration curve for Na adduct spectra (which on their own would be too few
for a sufficiently good recalibration curve.) The useRtLimit
parameter activates or
deactivates the usage of retention time constraints when searching for spectra with
findMsMsHR
.
It is useful to perform the workflow in two blocks, the first being step 1-4 and
the second being 5-8. After step 4, a graph is displayed which allows the user
to visually evaluate the performance of the recalibration. The top graphs show
the distribution of the mass deviation of MS/MS fragments from the predicted
mass and the recalibration curve calculated from them; the bottom graphs show
the mass deviation of MS precursor ions. The graph to the left is a complete xy
plot while the graph to the right is a 2D histogram (if the package gplots
is
installed on the user’s computer).
TODO: Workflow execution in Chunk 10 is currently disabled, I execute Chunk 11 instead for steps that are already done.
w <- msmsWorkflow(w, mode="pH", steps=c(1:4), archivename =
"pH_narcotics")