This document describes the purpose and use of the R-package LOBSTAHS: Lipid and Oxylipin Biomarker Screening through Adduct Hierarchy Sequences. The package is described in Collins et al. 2016.1 In the sections below, LOBSTAHS package functions are applied to a model dataset using example code. LOBSTAHS requires the additional packages xcms2 and CAMERA3; the model dataset, consisting of lipid data from cultures of the marine diatom Phaeodactylum tricornutum, is contained in the R data package PtH2O2lipids.
LOBSTAHS contains several functions to help scientists discover and identify lipid and oxidized lipid biomarkers in HPLC-MS data that have been pre-processed with the popular R packages xcms and CAMERA. First, LOBSTAHS uses exact mass to make initial compound assignments from a set of customizable onboard databases. LOBSTAHS allows users to easily generate custom databases containing entries for a wide range of lipids, oxidized lipids, and oxylipins; these are created using an automated, iterative computational approach based on structural criteria specified by the user in simple text tables/spreadsheets. After database identification, a series of orthogonal screening criteria are applied to refine and winnow the list of assignments. A basic workflow based on xcms, CAMERA, and LOBSTAHS is illustrated in the schematic. Each step in the figure is described in detail in the following paragraphs.
Users can the current production version of LOBSTAHS from Bioconductor by following the directions here (under “Installation”). The Bioconductor installation function will prompt you to install the latest versions of some other packages on which LOBSTAHS depends.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install()
Note: At this point, you should verify that you have, in fact, installed the most current version of Bioconductor. You may be required to install the latest version of R itself before you can obtain the latest version of Bioc. If you fail to ensure you are running the latest version of Bioconductor at this point, you won’t necessarily be able to install the most up-to-date version of LOBSTAHS since the Bioc installer will only provide you will the most recent versions of packages attached to that version of Bioconductor. (If you are confused by any of this, check out this help section. Or, just install the current development version directly from GitHub using the directions, below.) Once you are satisfied that you’re running the latest version of Bioconductor, you can then install LOBSTAHS:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("LOBSTAHS")
Users are also encouraged to download the PtH2O2lipids data package, which can be used for familiarization with the software.
Following these directions, you will install the latest version of the software from the files present in this GitHub repository. Some features may be unstable.
For windows: Download and install RTools from http://cran.r-project.org/bin/windows/Rtools/
For Unix: Install the R-development-packages (r-devel or r-base-dev)
The build_vignettes = TRUE
argument is required if
rendering of the full vignette is desired (recommended).
install_github()
does not render vignettes by default.
Data for LOBSTAHS should be acquired using a mass spectrometer with sufficiently high mass accuracy and resolution. Suitable MS acquisition platforms with which LOBSTAHS has been tested include Fourier transform ion cyclotron resonance (FT-ICR) and Orbitrap instruments. A quadrupole time-of-flight (Q-TOF) instrument could also be used if it possessed sufficient mass accuracy (i.e., < 5-7 ppm). While the software will accept any HPLC-MS data as input, insufficient mass accuracy and resolution will produce results with large numbers of database matches for each feature that cannot be distinguished from each other.
After acquisition, each data file should be converted from manufacturer to open source (.mzXML) format in centroid (not profile) mode. If data were acquired on the instrument using ion mode switching, the data from each ionization mode (i.e., polarity) must also be extracted into a separate file. The msConvert tool (part of the ProteoWizard toolbox) can be used to accomplish these tasks. msConvert commands can be executed with either the provided GUI or at the command line; however, conversion of manufacuter file formats can only be accomplished using the Windows installation. An R script (Exactive_full_scan_process_ms1+.r) for batch conversion and extraction of data files acquired on a Thermo Exactive Orbitrap instrument is provided as part of the Van Mooy Lab Lipidomics Toolbox. This vignette is not intended as a comprehensive guide to mass spectrometer file conversion; users are encouraged to digest the msConvert documentation. However, assuming a hypothetical data file “Exactive_data.raw” was acquired on an Thermo Orbitrap instrument with ion mode switching, the following code could be within R used to convert and then extract positive and negative mode scans to separate files. (Code presumes you’ve already installed msConvert.)
After all data files in a particular dataset have been converted and extracted into files of like polarity, the R-package xcms can then be used to perform feature detection, retention time correction, and peak grouping. While the paragraphs below contain basic instructions for preparation of data, this vignette is not intended as ae manual or guide to the complex world of mass spectrometer data processing in xcms and CAMERA; users should acquaint themselves with the manuals (here and here) and very helpful vignettes (here and here) for the two packages.
First, data files should be given intuitive file names (containing, e.g., a sample ID along with information on experimental timepoint and treatment) and placed into a directory structure according to an index variable; the xcms vignette LC/MS Preprocessing and Analysis with xcms contains a detailed explanation. Feature detection, retention time correction, and peak grouping should then be performed. Values of parameters for xcms functions can be obtained from several sources:
The R script prepOrbidata.R (part of the Van Mooy Lab Lipidomics Toolbox) contains code for complete preparation in xcms of the PtH2O2lipids (or similar) dataset. The script allows the user to apply parameter values assembled from the literature or obtain them from IPO optimization of a subset of samples. The final parameter values used in Collins et al. 20161 for analysis of the PtH2O2lipids dataset – obtained using HPLC-ESI-MS on an Exactive Orbitrap instrument – are given in Table S5 of the electronic supplement. The settings and values listed in the table could be used as a starting point for analysis of similar data.
The xcmsSet produced using these settings from the P. tricornutum data is stored in the PtH2O2lipids package.
Whatever settings are used in xcms, the processed data should contain
high-quality features which have been aligned across samples; the
quality of the processed data should be verified by individual
inspection of a subset of features. At the conclusion of processing with
xcms, the user should have a single xcmsSet
object
containing the dataset.
In the final pre-processing step, the R-package CAMERA
should be used to (1) aggregate peak groups in the xcmsSet
object into pseudospectra and (2) identify features in the data
representing likely isotope peaks. Use of CAMERA to aggregate the peak
groups into pseudospectra is critical because LOBSTAHS applies its
various orthogonal screening criteria to only those peak groups within
each pseudospectrum. CAMERA should not be used to eliminate adduct ions
since the adduct ion hierarchy screening function in LOBSTAHS presumes
that all adduct ions for a given analyte will be present in the dataset.
We can create an xsAnnotate
object from the P.
tricornutum data by applying the wrapper function
annotate()
to the xcmsSet
we created in the
previous step:
library(xcms)
library(CAMERA)
library(LOBSTAHS)
# first, a necessary workaround to avoid a import error; see
# https://support.bioconductor.org/p/69414/
imports = parent.env(getNamespace("CAMERA"))
unlockBinding("groups", imports)
imports[["groups"]] = xcms::groups
lockBinding("groups", imports)
# create annotated xset using wrapper annotate(), allowing us to perform all
# CAMERA tasks at once
xsA = annotate(ptH2O2lipids$xsAnnotate@xcmsSet,
quick=FALSE,
sample=NA, # use all samples
nSlaves=1, # set to number of available cores or processors if
# > 1
# group FWHM settings (defaults)
sigma=6,
perfwhm=0.6,
# groupCorr settings (defaults)
cor_eic_th=0.75,
graphMethod="hcs",
pval=0.05,
calcCiS=TRUE,
calcIso=TRUE,
calcCaS=FALSE,
# findIsotopes settings
maxcharge=4,
maxiso=4,
minfrac=0.5,
# adduct annotation settings
psg_list=NULL,
rules=NULL,
polarity="positive", # the PtH2O2lipids xcmsSet contains
# positive-mode data
multiplier=3,
max_peaks=100,
# common to multiple tasks
intval="into",
ppm=2.5,
mzabs=0.0015
)
#> Start grouping after retention time.
#> Created 113 pseudospectra.
#> Generating peak matrix!
#> Run isotope peak annotation
#> % finished: 10 20 30 40 50 60 70 80 90 100
#> Found isotopes: 5692
#> Start grouping after correlation.
#> Generating EIC's ..
#>
#> Calculating peak correlations in 113 Groups...
#> % finished: 10 20 30 40 50 60 70 80 90 100
#>
#> Calculating isotope assignments in 113 Groups...
#> % finished: 10 20 30 40 50 60 70 80 90 100
#> Calculating graph cross linking in 113 Groups...
#> % finished: 10 20 30 40 50 60 70 80 90 100
#> New number of ps-groups: 5080
#> xsAnnotate has now 5080 groups, instead of 113
#> Generating peak matrix for peak annotation!
#>
#> Calculating possible adducts in 5080 Groups...
#> % finished: 10 20 30 40 50 60 70 80 90 100
We now have an xsAnnotate
object “xsA” to which we will
next apply the screening and identification functions of LOBSTAHS. In
the annotate()
call, we set quick = FALSE
because we want to run groupCorr()
. This will also cause
CAMERA to perform internal adduct annotation. While we will perform our
own adduct annotation later with LOBSTAHS, allowing CAMERA to identify
its own adducts doesn’t hurt, particularly if it leads to the creation
of better pseudospectra.
Before screening a dataset with LOBSTAHS, we must first decide
whether to use one of two default databases, or to generate our own from
the templates provided (instructions below). LOBSTAHS databases contain
a mixture of in silico and empirical data for the different
adduct ions of a wide range of intact polar diacylglycerols (IP-DAG),
triacylglycerols (TAG), polyunsaturated aldehydes (PUAs), free fatty
acids (FFA), and common photosynthetic pigments. In addition, the latest
LOBSTAHS release includes support for lyso lipids under an “IP_MAG”
species class. Functionality for other lipid classes is added regularly.
The default databases (as of November 8, 2017) include 25,741 and 21,063
unique compounds that can be identifed in positive and negative
ionization mode data, respectively. The databases can be easily
customized (see below)
if we wish to identify additional lipids in new lipid classes. LOBSTAHS
databases are contained in LOBdbase
objects that are
generated or accessed by various package functions. Each
LOBdbase
is specific to a particular polarity (i.e., ion
mode); when evaluating positive-mode data, we use a positive mode
database (and vice versa).
By virtue of the origin of LOBSTAHS — the study of lipids in marine algae and bacteria — most of the lipids currently in the database are of marine microbial origin. While algae and marine bacteria have many lipids in common with humans and other mammals, there are many classes which are not currently represented in the databases. If you find yourself adding new lipids to the default databases because the existing coverage is insufficient in your research area, consider sharing your additions with others via a pull request or by opening an issue on the package GitHub page. Some new lipid classes (e.g., sphingolipids and most non-glycerolipids) may not be amenable to in silico simulation in the same easy fashion as the existing lipid classes for which functionality already exists in the package code. In these cases, users should open a new issue request for new functionality or — in the spirit of open-source collaboration — modify the code in your own Git fork and then request incorporation of your new feature so others may also discover new lipids!
We can access the default databases to examine their scope:
##
## A positive polarity "LOBdbase" object.
##
## Contains entries for 133273 possible adduct ions of 25741 unique parent compounds.
##
## Parent lipid types ( 22 ): astaxanthin, bile_salt, DNPPE, fungalGSL, hapCER, hapGSL, hGSL, IP_DAG, IP_MAG, PDMS, pigment, plastoquinone_9, plastoquinone_9OH, plastoquinone_9OH2, PUA, reduced_scytonemin, scytonemin, sGSL, sterol, TAG, ubiquinone, vGSL
## IP-DAG classes ( 14 ): WaxEster, DGCC, DGDG, DGTS_DGTA, DAG, PA, PE, PG, PC, SQDG, MGDG, BLL, PDPT, S_DGCC
## IP-MAG classes ( 12 ): CoprostanolEsters, CholesterolEsters, LDGCC, LDGDG, LDGTS_DGTA, MAG, LPA, LPE, LPG, LPC, LSQDG, LMGDG
## Photosynthetic pigments ( 22 ): Chl_a, 19prime_but_fuco, 19prime_hex_fuco, Allox, Alpha_carotene, Beta_carotenes, Chl_b, Chl_c2, Chl_c3, Chlide_a, Croco, Dd_Ddc, Dt, Echin, Fuco, Lut, Neox_Nos, Peri, Pheophytin_a, Pras, Viol, Zeax
## Adducts ( 9 ): [M+2Na-H]+, [M+2Na+Cl]+, [M+C2H3Na2O2]+, [M+C4H10N3]+, [M+H]+, [M+K]+, [M+Na]+, [M+NH4]+, [M+NH4+ACN]+
##
## m/z range: 97.0647914-2016.54119204
##
## Ranges of chemical parameters represented in molecules with acyl moieties:
##
## Total number of acyl carbon atoms: 6-78
## Total number of acyl carbon-carbon double bonds: 0-18
## Number of additional oxygen atoms: 0-4
##
## Memory usage: 11.4 MB
##
## A negative polarity "LOBdbase" object.
##
## Contains entries for 97007 possible adduct ions of 21063 unique parent compounds.
##
## Parent lipid types ( 16 ): bile_salt, DNPPE, FFA, fungalGSL, hapCER, hapGSL, hGSL, IP_DAG, IP_MAG, pigment, plastoquinone_9, PUA, scytonemin, sGSL, ubiquinone, vGSL
## IP-DAG classes ( 13 ): DGCC, DGDG, DGTS_DGTA, DAG, PA, PE, PG, PC, SQDG, MGDG, BLL, PDPT, S_DGCC
## IP-MAG classes ( 10 ): LDGCC, LDGDG, LDGTS_DGTA, MAG, LPA, LPE, LPG, LPC, LSQDG, LMGDG
## Photosynthetic pigments ( 22 ): Chl_a, 19prime_but_fuco, 19prime_hex_fuco, Allox, Alpha_carotene, Beta_carotenes, Chl_b, Chl_c2, Chl_c3, Chlide_a, Croco, Dd_Ddc, Dt, Echin, Fuco, Lut, Neox_Nos, Peri, Pheophytin_a, Pras, Viol, Zeax
## Adducts ( 11 ): [M-H]-, [M+2NaAc+Cl]-, [M+3Ac+2Na]-, [M+Cl]-, [M+HAc-H]-, [M+K-2H]-, [M+Na-2H]-, [M+Na+Cl-H]-, [M+NaAc-H]-, [M+NaAc+Cl]-, [M+NaAc+HAc-H]-
##
## m/z range: 95.05023848-1459.92498456
##
## Ranges of chemical parameters represented in molecules with acyl moieties:
##
## Total number of acyl carbon atoms: 6-52
## Total number of acyl carbon-carbon double bonds: 0-12
## Number of additional oxygen atoms: 0-4
##
## Memory usage: 8.55 MB
Though it might not be evident from examination of these defaults,
LOBSTAHS generates databases based on values of parameters defined in
simple tables. The values in these tables define the molecular
properties of each individual compound or lipid class for which database
entries will be created. Database generation is accomplished with the
function generateLOBdbase()
. Default values for the various
inputs can be viewed by loading the default tables into the R workspace
(directions below); the default input data are also given in Tables 1
and 2 of Collins et al. 2016.1 Package
versions ≥ 1.1.2 contain adduct hierarchy and retention time window data
for lipid classes BLL, PDPT, vGSL, sGSL, hGSL, hapGSL, and hapCER, which
are described in Fulton et al. 2014 and Hunter et al. 2015.6. By modifying the default values in each table
using provided templates, users can generate custom databases that
include additional lipid classes. Customization directions
are provided in the
following section.
Users can load the default versions of the four input tables into the R workspace (and subsequently view them) at any time:
Custom databases can be created in LOBSTAHS by modifying or adding to the default values in the four tables. We can generate custom databases in three steps:
.libPaths()
to find the location of
your R library.) Alternatively, templates for the tables used to
generate databases in the latest release of LOBSTAHS can downloaded from
the package
GitHub repository.generateLOBdbase()
, which will then uses the value to
generate the LOBSTAHS databasesEach of the four tables is introduced below, followed by instructions for adding new lipids and lipid classes to the LOBSTAHS databases.
We’ve also created a flow chart to help you navigate the database customization process; the flow chart is designed to complement — not replace — the detailed instructions below. Download a copy of the flow chart: as .pdf | as .svg
componentCompTable
The first, most important table is the
componentCompTable
, which defines the base elemental
compositions of each molecule or lipid class to be considered when
databases are created. If we wish to add additional molecules or classes
of molecules to the default set of compounds in the database, we’ll
first need to add an additional entry or entries to this table.
For each lipid class or compound specified in the
componentCompTable
, the field
DB_gen_compound_type
contains one of five values:
DB_gen_compound_type
is the field which determines how a
particular compound, lipid class, or molecular fragment will be used and
evaluated by the software. The last three compound types
(“basic_component,” “adduct_pos,” and “adduct_neg”) are reserved for
definition of basic components such as acteonitrile or acetate and for
definition of adduct ion types; new entries of these types should only
be created in the componentCompTable
when a new adduct or
basic component is to be specified.
The other two compound types in the componentCompTable
(“DB_acyl_iteration” and “DB_unique_species”) are used to define the way
generateLOBdbase
creates database entries for different
lipids and lipid classes. generateLOBdbase
can create two
kinds of database entries:
When DB_gen_compound_type
is set to
“DB_unique_species”, generateLOBdbase()
will create
database entries only for adduct ions of the individual compound
specified. (This is the simpler of the two cases.) We use this type for
pigments and other lipids that do not have acyl groups, or in cases when
we do not wish to consider any possible variation in acyl properties. In
the case where DB_gen_compound_type
= “DB_unique_species”,
the exact mass of the complete (neutral) molecule is
specified in the component definitions table. The photosynthetic
pigments currently in the database provide an excellent template for
creation of new compounds of the “DB_unique_species” type.
When DB_gen_compound_type
is set to
“DB_acyl_iteration”, generateLOBdbase
will automatically
create multiple database entries for adduct ions of multiple molecular
species within a lipid class based on the ranges of acyl properties and
oxidation states specified in two other user-editable tables,
acylRanges
and oxyRanges
. In the case of lipid
classes for which DB_gen_compound_type
=
“DB_acyl_iteration,” the compound table is used to define the
exact mass of a “base fragment” for the lipid class.
Proper specification of the base fragment is
explained below. Using the
fragment as a starting point, generateLOBdbase
creates
multiple entries for molecules in the lipid class by iterative addition
of various combinations of fatty acids.
acylRanges
and oxyRanges
tablesacylRanges
specifies the ranges of acyl properties to be
considered for each lipid class when DB_gen_compound_type
is set to “DB_acyl_iteration”. These properties include total acyl
carbon chain length (i.e., the total number of carbon atoms in the fatty
acids that make up TAG, PUA, IP-DAG, and IP-MAG) and degree of acyl
carbon chain unsaturation (i.e., the number of possible carbon-carbon
fatty acid double bonds) considered for each lipid class.
oxyRanges
specifies the range of possible oxidation
states (i.e., the number of additional oxygen atoms to be considered on
each compound) we wish to consider for each of these compounds
adductHierarchies
tableA fourth table, adductHierarchies
, contains empirical
data on the relative abundances of various adduct ions formed by the
compounds that belong to each lipid class. This table is also
user-editable, but any additions or changes should first be confirmed by
empirical analysis. Note that data should be provided
in adductHierarchies
for all lipid classes or compounds,
regardless of the DB_gen_compound_type
, above. In addition,
an adduct hierarchy must be specified in the
adductHierarchies
table for each compound or compound class
specified in the Adduct_hierarchy_lookup_class
field of the
componentCompTable
.
What modifications we make to the table(s) will depend on our goal.
For example, if we wish only to expand (or constrain) the range of
molecular properties for which database entries are to be created within
an existing lipid class (i.e., of DB_gen_compound_type
=
“DB_acyl_iteration”), we’ll only have to modify existing values in the
acylRanges
and/or oxyRanges
tables. However,
if we wish to add new molecule(s) or lipid class(es) to the database(s),
we’ll first need to define the new molecule/class in both the
componentCompTable
( LOBSTAHS_componentCompTable.xlsx)
and adductHierarchies
table ( LOBSTAHS_adductHierarchies.xlsx).
This is relatively trivial; don’t be discouraged! We will consider two
cases:
If the new entry is a single molecule (e.g., a new pigment), then new
entries in the componentCompTable
and
adductHierarchies
table (and, if desired, in the
rt.windows
table; see
below) will be
sufficient.
Specify the DB_gen_compound_type
In this instance, we should set “DB_unique_species” =
DB_gen_compound_type
when defining the species in the
componentCompTable
.
Define neutal mass of molecule
The entire exact mass of the neutral compound should
be defined in the componentCompTable
; we do not define the
mass of a charged adduct ion, since adduct ion masses will be calculated
automatically by the software.
Define adduct hierarchies
We’ll then need to define appropriate adduct hierarchies for the
compound in the adductHierarchies
table, with compound or
class names that match the Adduct_hierarchy_lookup_class
field in the componentCompTable
. The same adduct hierarchy
can be used for multiple compounds independently defined using the
“DB_unique_species” type so long as the same adduct class name is
used.
Specify retention time data, if desired
Retention time window data for the compound can also be added (if
desired) to the rt.windows
table (see
below).
Alternatively, if we wish to add an entire class of lipid for which
multiple database entries will be created based on ranges of acyl
properties and oxidation states specified in acylRanges
(
LOBSTAHS_acylRanges.xlsx)
and oxyRanges
( LOBSTAHS_oxyRanges.xlsx),
we have to take just a bit more time to define the class and
its acyl properties.
Specify the DB_gen_compound_type
First, in the componentCompTable
, we should set
DB_gen_compound_type
= “DB_acyl_iteration”.
Determine and define the base fragment
We then have to define the mass of the “base fragment” upon which the
generateLOBdbase()
function will perform iteration. The
base fragment should include the glycerol backbone, any carboxylic
oxygen atoms in the fatty acid(s), and (if applicable), the entire lipid
headgroup. In the case of IP-DAG and IP-MAG, the base fragment will
include the entire polar headgroup, the glycerol backbone, and both
carboxylic oxygen atoms in the fatty acid(s). In the case of TAG, the
base fragment is defined as the glycerol backbone plus the carboxylic
oxygen atoms on each of the three fatty acids. The base fragments for
any new lipid classes for which the user desires evaluation of a range
of acyl properties should be similarly defined.
An example is given for phosphatidylcholine (PC), one of the eight
basic IP-DAG classes that has been included in LOBSTAHS since its first
release. We begin with an intact molecule, in this case PC 18:1, 18:1
(which would appear in a LOBSTAHS database as PC 36:2). Any intact PC
species could be used as a starting point since our concern here is with
the polar (headgroup) end of the molecule, which does not change under
the simulation performed by generateLOBdbase()
:
This intact neutral molecule has the chemical formula
C44H84NO8P and an
exact mass of 785.59346 amu. The C-1 carbon atom in
both fatty acids is shown explicitly for ease of illustration. We define
the base fragment for PC (and all other IP-DAG) as the polar headgroup,
glycerol backbone, and both the carboxylic oxygen atoms (in red):
In this case, the base fragment (shown in red) has the chemical
formula C8H18NO8P and
a mass of 287.07700 amu. One “easy” way to determine
the formula (and therefore, the exact mass) of the base fragment is to
use the fragmentation feature in an application such as ChemDraw. First,
we homolytically cleave the bond(s) between the C-1 and C-2 carbon
atom(s) in each acyl chain on which iteration is to be performed, giving
one electron to each atom: