This project is a parser, validator and normalizer implementation for shorthand lipid nomenclatures, base on the Grammar of Succinct Lipid Nomenclatures project.
Goslin defines multiple grammars for different sources of shorthand lipid nomenclature. This allows to generate parsers based on the defined grammars, which provide immediate feedback whether a processed lipid shorthand notation string is compliant with a particular grammar, or not.
NOTE: Please report any issues you might find to help improve it!
Here, rgoslin 2.0 uses the Goslin grammars and the cppgoslin parser to support the following general tasks:
The package can be installed with the
BiocManager package as follows:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("rgoslin")
In order to use the provided translation functions of rgoslin, you first need to load the library. If rgoslin is not yet available, please follow the instructions in the previous section on “Installation”.
If you want to check, which grammars are supported, use the following command:
listAvailableGrammars() #>  "Shorthand2020" "Goslin" "FattyAcids" "LipidMaps" #>  "SwissLipids" "HMDB"
To check, whether a given lipid name can be parsed by any of the parsers, you can use the
isValidLipidName method. It will return
TRUE if the given name can be parsed by any of the available parsers and
FALSE if the name was not parseable.
isValidLipidName("PC 32:1") #>  TRUE
parseLipidNames with a lipid name returns a data frame of properties of the parsed lipid name as columns.
df <- parseLipidNames("PC 32:1")
If you want to set the grammar to parse against manually, this is also possible as the second argument:
originalName <- "TG(16:1(5E)/18:0/20:2(3Z,6Z))" tagDf <- parseLipidNames(originalName, grammar = "LipidMaps")
If you want to parse multiple lipid names, use the
parseLipidNames method with a vector of lipid names. This returns a data frame of properties of the parsed lipid names with one row per lipid.
NOTE: Omitting the grammar argument will test all available parsers, the first one to successfully parse the name wins. This will consequently take longer than explicitly setting the grammar to select the parser for.
multipleLipidNamesDf <- parseLipidNames(c("PC 32:1","LPC 34:1","TG(18:1_18:0_16:1)"))
Finally, if you want to parse multiple lipid names and want to use one particular grammar, simply add its name as the “grammar” argument.
originalNames <- c("PC 32:1","LPC 34:1","TAG 18:1_18:0_16:1") multipleLipidNamesWithGrammar <- parseLipidNames(originalNames, grammar = "Goslin")
LIPID MAPS has a number of fatty acids that use names following the IUPAC-IUB Fatty Acids nomenclature. These can now also be parsed and converted to the updated lipid shorthand nomenclature. We are using LMFA01020216 and LMFA08040030 as examples here:
originalNames <- c("LMFA01020216"="5-methyl-octadecanoic acid", "LMFA08040030"="N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine") normalizedFattyAcidsNames <- parseLipidNames(originalNames, "FattyAcids")
The Goslin parser also support reading of lipid shorthand names with adducts:
originalNames <- c("PC 32:1[M+H]1+", "PC 32:1 [M+H]+","PC 32:1") lipidNamesWithAdduct <- parseLipidNames(originalNames, "Goslin")
This will populate the columns “Adduct” and “AdductCharge” with the respective values. Please note that we recommend to use the adduct and its charge in full IUPAC recommended nomenclature:
lipidr is a Bioconductor package with specific support for QC checking, uni- and multivariate analysis and visualization of lipidomics data acquired with Skyline or from metabolomics workbench. It uses a custom implementation for lipid name handling that does not yet support the updated shorthand nomenclature.
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("lipidr")
After installation of lipidr, we need to load the following libraries for this example:
library(rgoslin) library(lipidr) library(stringr) library(ggplot2)
We will use lipidr’s workflow example to illustrate how to apply rgoslin in such a use-case. We read in an example dataset exported from Skyline based on a targeted lipidomics experiment:
datadir = system.file("extdata", package="lipidr") filelist = list.files(datadir, "data.csv", full.names = TRUE) # all csv files d = read_skyline(filelist) #> Successfully read 3 methods. #> Your data contain 58 samples, 10 lipid classes, 277 lipid molecules. clinical_file = system.file("extdata", "clin.csv", package="lipidr") d = add_sample_annotation(d, clinical_file)
This dataset contains the lipid names in the
Molecule column and the name preprocessed with
lipidr in the
clean_name column. In this excerpt, you can see
PE 34:1 NEG in row 5, which indicates measurement of this lipid in negative mode. Please note that this is specific to this example and not a generally applied naming convention.
Now, let’s try to parse the clean lipid names to enrich the data table.
Note: In this case, we expect to see error messages, since some lipid names use a) unsupported head group names or b) unsupported suffixes to indicate isotopically labeled lipids.
lipidNames <- parseLipidNames(rowData(d)$clean_name) #> Encountered an error while parsing 'PE 15:0-18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'PG 15:0-18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'PI 15:0-18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'So1P 17:1': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'So1P 18:1': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'SM 18:1(d9)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'PC 15:0-18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'LPC 18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'LPE 18:1(d7)': Expecting a single string value: [type=character; extent=4]. #> Encountered an error while parsing 'PC 15:0-18:1(d7)': Expecting a single string value: [type=character; extent=4].
We see that lipidr’s example dataset uses the
(d9) suffix to indicate isotopically labeled lipids (d=Deuterium) that are used as internal standards for quantification.
We need to convert
Sa1P, sphinganine-1-phosphate, to
SPBP, as well as
So1P, sphingosine-1-phosphate to align with recent nomenclature updates.
# lipidr stores original lipid names in the Molecule column old_names <- rowData(d)$Molecule # split lipid prefix from potential (d7) suffix for labeled internal standards new_names <- rowData(d)$clean_name %>% str_match(pattern="([\\w :-]+)(\\(\\w+\\))?") # extract the first match group (the original word is at column index 1) normalized_new_names <- new_names[,2] %>% str_replace_all(c("Sa1P"="SPBP","So1P"="SPBP")) %>% parseLipidNames(.)
We will receive a number of warnings in the next step, since lipidr currently checks lipid class names against a predefined, internal list that does not contain the updated class names according to the new shorthand nomenclature. The updated Molecule names will however appear in downstream visualizations.
updated <- update_molecule_names(d, old_names, normalized_new_names$Normalized.Name) #> Joining with `by = join_by(Molecule)` #> Warning in annotate_lipids(updated_names$Molecule): Some lipid names couldn't be parsed because they don't follow the pattern 'CLS xx:x/yy:y' #> PE O-34:0, PE O-34:1, PE O-34:2, PE O-36:0, PE O-36:1, PE O-36:2, PE O-36:3, PE O-36:4, PE O-36:5, PE O-38:0, PE O-38:1, PE O-38:2, PE O-38:3, PE O-38:4, PE O-38:5, PE O-40:5, PE O-40:6, PE O-40:7, PE O-32:1, PE O-34:3, PE O-38:6, PE O-38:7, PI O-34:1, SPBP(1) 18:0;OH, SPBP 17:1;O2, SPBP 18:1;O2, SM 18:1;O2, SM 18:0;O2, Cer 18:0;O2, Cer 18:1;O2, Cer 18:2;O2, PC O-32:0, PC O-34:1, PC O-34:2, PC O-34:3, PC O-36:2, PC O-36:3, PC O-36:4, PC O-36:5, PC O-38:0, PC O-38:2, PC O-38:3, PC O-38:4, PC O-38:5, PC O-40:0, PC O-40:1, PC O-40:2, PC O-40:3, PC O-40:4, PC O-40:5, PC O-40:6, PC O-40:7, PC O-32:1, PC O-34:4, PC O-36:6, PC O-38:1, PC O-38:6, PC O-38:7 #> Joining with `by = join_by(Molecule)`
We can augment the class column using rgoslin’s LipidMaps main class, since row order in rgoslin’s output is the same as in its input. Additionally, further information may be interesting to include, such as the mass and sum formula of the uncharged lipids. We will select the same lipid classes as in the lipidr targeted lipidomics vignette: Ceramides (Cer), Lyso-Phosphatidylcholines (LPC) and Phosphatidylcholines (PC).
rowData(updated)$Class <- normalized_new_names$Lipid.Maps.Main.Class rowData(updated)$Category <- normalized_new_names$Lipid.Maps.Category rowData(updated)$Molecule <- normalized_new_names$Normalized.Name rowData(updated)$LipidSpecies <- normalized_new_names$Species.Name rowData(updated)$Mass <- normalized_new_names$Mass rowData(updated)$SumFormula <- normalized_new_names$Sum.Formula # select Ceramides, Lyso-Phosphatidylcholines and Phosphatidylcholines (includes plasmanyls and plasmenyls) lipid_classes <- rowData(updated)$Class %in% c("Cer","LPC", "PC") d <- updated[lipid_classes,]
In the next step, we use a non-exported method provided by lipidr to convert the row data into a format more suitable for plotting with ggplot. We will plot the area distribution of lipid species as boxplots, colored by lipid class and facetted by filename, similar to some plot examples in lipidr’s vignette.
ddf <- lipidr:::to_long_format(d) ggplot(data=ddf, mapping=aes(x=Molecule, y=Area, fill=Class)) + geom_boxplot() + facet_wrap(~filename, scales = "free_y") + scale_y_log10() + coord_flip() #> Warning: Transformation introduced infinite values in continuous y-axis #> Warning: Removed 15 rows containing non-finite values (`stat_boxplot()`).