In this tutorial, we would like to demonstate the use of CluMSID with a publicly available LC-MS/MS data set deposited on MetaboLights. We chose data set MTBLS433 that can be accessed on the MetaboLights web page (https://www.ebi.ac.uk/metabolights/MTBLS433) and which has been published in the following article:
Kalogiouri, N. P., Alygizakis, N. A., Aalizadeh, R., & Thomaidis, N. S. (2016). Olive oil authenticity studies by target and nontarget LC–QTOF-MS combined with advanced chemometric techniques. Analytical and bioanalytical chemistry, 408(28), 7955-7970.
The authors analysed olive oil of various providence using reversed-phase ultra high performance liquid chromatography–electrospray ionisation quadrupole time of flight tandem mass spectrometry in negative mode with auto-MS/MS fragmentation.
As a representative pooled sample is not provided, we will combine MS2 data from several runs and use the peak picking done by the authors of the study for the merging of MS2 spectra. Some metabolite annotations are also included in the MTBLS433 data set which we will integrate into our analysis.
For demonstration, not all files from the analysis will be included into the analysis. Four data files from the data set have been chosen that represent olive oil samples from different regions in Greece:
YH1_GA7_01_10463.mzML: YH1, from Komi
AX1_GB5_01_10470.mzML: AX1, from Megaloxori
LP1_GB3_01_10467.mzML: LP1, from Moria
BR1_GB6_01_10471.mzML: BR1, from Agia Paraskevi
Note that these are mzML files that can be processed the exact same way as mzXML files.
Furthermore, we would like to use the peak picking and annotation data from the original authors which we can read from the file
First, we extract MS2 spectra from the respective files separately by using
extractMS2spectra(). Then, we just combine the resulting lists into one list using base R functionality:
YH1 <- system.file("extdata", "YH1_GA7_01_10463.mzML", package = "CluMSIDdata") AX1 <- system.file("extdata", "AX1_GB5_01_10470.mzML", package = "CluMSIDdata") LP1 <- system.file("extdata", "LP1_GB3_01_10467.mzML", package = "CluMSIDdata") BR1 <- system.file("extdata", "BR1_GB6_01_10471.mzML", package = "CluMSIDdata") YH1list <- extractMS2spectra(YH1) AX1list <- extractMS2spectra(AX1) LP1list <- extractMS2spectra(LP1) BR1list <- extractMS2spectra(BR1) raw_oillist <- c(YH1list, AX1list, LP1list, BR1list)
First, we import the peak list by reading the respective table and filtering for the relevant information. We only need the columns
rentention_time and we would like to replace
"unknown" with an empty field in the
metabolite_identification column. Plus, the features do not have a unique identifier in the table but we can easily generate that from m/z and RT. Note that the retention time in the raw data is given in seconds and in the data table it is in minutes, so we have to convert. For the sake of consistency, we also change the column names. We use
tidyverse syntax but users can do as they prefer.
raw_mtbls_df <- system.file("extdata", "m_mtbls433_metabolite_profiling_mass_spectrometry_v2_maf.tsv", package = "CluMSIDdata") require(magrittr) mtbls_df <- readr::read_delim(raw_mtbls_df, "\t") %>% dplyr::mutate(metabolite_identification = stringr::str_replace(metabolite_identification, "unknown", "")) %>% dplyr::mutate(id = paste0("M", mass_to_charge, "T", retention_time)) %>% dplyr::mutate(retention_time = retention_time * 60) %>% dplyr::select(id, mass_to_charge, retention_time, metabolite_identification) %>% dplyr::rename(mz = mass_to_charge, rt = retention_time, annotation = metabolite_identification)
This peak list, or its first three columns, can now be used to merge spectra. We exclude spectra that do not match to any of the peaks in the peak list. As we are not very familiar with instrumental setup, we set the limits for retention time and m/z deviation a little wider. To make an educated guess on mass accuracy, we take a look at an identified metabolite, its measured m/z and its theoretical m/z. We use arachidic acid [M-H]-, whose theoretical m/z is 311.2956:
So, we will work with an an m/z tolerance of 30ppm (which seems rather high for a high resolution mass spectrometer).
A 30ppm mass accuracy necessitates an m/z tolerance of 60ppm, because deviations can go both ways:
To add annotations, we use
mtbls_df as well, as described in the General Tutorial:
For the generation of the distance matrix, too, we use an m/z tolerance of 30ppm:
To explore the data, we have a look at a cluster dendrogram:
Since it was not in the focus of their study, the authors identified only a few metabolites. If we look at the positions of these metabolites in the cluster dendrogram, we see that the poly-unsaturated fatty acids alpha-linolenic acid and alpha-linolenic acid are nicely separated from the saturated fatty acids stearic acid and arachidic acid. We would expect the latter to cluster together but a look at the spectra reveals that stearic acid barely produces any fragment ions and mainly contains the unfragmented [M-H]- parent ion:
In contrast, arachidic acid produces a much richer spectrum:
Inspecting the features that cluster close to arachidic acid shows that many of them have an exact m/z that conforms with other fatty acids of different chain length or saturation (within the m/z tolerance), e.g. the neighbouring feature M339.2125T15.32 that could be arachidonic acid [M+Cl]-.
Looking at oleic acid [M-H]-, we see that it clusters very closely to M563.5254T13.93, whose m/z is consistent with oleic acid [2M-H]- and some other possible adducts.
As a last example, the only identified metabolite that does not belong to the class of fatty acids is acetosyringone, a phenolic secondary plant metabolite. It forms part of a rather dense cluster in the dendrogram, suggesting high spectral similarities to the other members of the cluster. It would be interesting to try to annotate more of these metabolite to find out if they are also phenolic compounds.
In conclusion, we demonstrated how to use
CluMSID with a publicly available data set from the MetaboLights repository and how to include external information such as peak lists or feature annotations into a
CluMSID workflow. In doing so, we had a look on a few example findings that could help to annotate more of the features in the data set and thereby showed the usefulness of
CluMSID for samples very different from the ones in the other tutorials.
sessionInfo() #> R version 4.1.0 (2021-05-18) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 20.04.2 LTS #> #> Matrix products: default #> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so #> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so #> #> locale: #>  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #>  LC_TIME=en_GB LC_COLLATE=C #>  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #>  LC_PAPER=en_US.UTF-8 LC_NAME=C #>  LC_ADDRESS=C LC_TELEPHONE=C #>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #>  stats4 parallel stats graphics grDevices utils datasets #>  methods base #> #> other attached packages: #>  magrittr_2.0.1 metaMSdata_1.27.0 metaMS_1.28.0 #>  CAMERA_1.48.0 xcms_3.14.0 MSnbase_2.18.0 #>  ProtGenerics_1.24.0 S4Vectors_0.30.0 mzR_2.26.0 #>  Rcpp_1.0.6 BiocParallel_1.26.0 Biobase_2.52.0 #>  BiocGenerics_0.38.0 CluMSIDdata_1.7.0 CluMSID_1.8.0 #> #> loaded via a namespace (and not attached): #>  backports_1.2.1 Hmisc_4.5-0 #>  plyr_1.8.6 igraph_1.2.6 #>  lazyeval_0.2.2 splines_4.1.0 #>  GenomeInfoDb_1.28.0 ggplot2_3.3.3 #>  digest_0.6.27 foreach_1.5.1 #>  htmltools_0.5.1.1 fansi_0.4.2 #>  checkmate_2.0.0 rle_0.9.2 #>  cluster_2.1.2 doParallel_1.0.16 #>  limma_3.48.0 readr_1.4.0 #>  sna_2.6 matrixStats_0.58.0 #>  jpeg_0.1-8.1 colorspace_2.0-1 #>  xfun_0.23 dplyr_1.0.6 #>  crayon_1.4.1 RCurl_1.98-1.3 #>  jsonlite_1.7.2 graph_1.70.0 #>  impute_1.66.0 survival_3.2-11 #>  iterators_1.0.13 ape_5.5 #>  glue_1.4.2 gtable_0.3.0 #>  zlibbioc_1.38.0 XVector_0.32.0 #>  DelayedArray_0.18.0 DEoptimR_1.0-8 #>  scales_1.1.1 vsn_3.60.0 #>  DBI_1.1.1 GGally_2.1.1 #>  viridisLite_0.4.0 htmlTable_2.2.1 #>  clue_0.3-59 foreign_0.8-81 #>  preprocessCore_1.54.0 Formula_1.2-4 #>  MsCoreUtils_1.4.0 htmlwidgets_1.5.3 #>  httr_1.4.2 gplots_3.1.1 #>  RColorBrewer_1.1-2 ellipsis_0.3.2 #>  farver_2.1.0 pkgconfig_2.0.3 #>  reshape_0.8.8 XML_3.99-0.6 #>  nnet_7.3-16 sass_0.4.0 #>  utf8_1.2.1 tidyselect_1.1.1 #>  rlang_0.4.11 munsell_0.5.0 #>  tools_4.1.0 cli_2.5.0 #>  dbscan_1.1-8 generics_0.1.0 #>  statnet.common_4.4.1 evaluate_0.14 #>  stringr_1.4.0 mzID_1.30.0 #>  yaml_2.2.1 knitr_1.33 #>  robustbase_0.93-7 caTools_1.18.2 #>  purrr_0.3.4 RANN_2.6.1 #>  ncdf4_1.17 RBGL_1.68.0 #>  nlme_3.1-152 compiler_4.1.0 #>  rstudioapi_0.13 plotly_4.9.3 #>  png_0.1-7 affyio_1.62.0 #>  MassSpecWavelet_1.58.0 tibble_3.1.2 #>  bslib_0.2.5.1 stringi_1.6.2 #>  ps_1.6.0 highr_0.9 #>  lattice_0.20-44 Matrix_1.3-3 #>  vctrs_0.3.8 pillar_1.6.1 #>  lifecycle_1.0.0 BiocManager_1.30.15 #>  jquerylib_0.1.4 MALDIquant_1.19.3 #>  data.table_1.14.0 bitops_1.0-7 #>  GenomicRanges_1.44.0 R6_2.5.0 #>  latticeExtra_0.6-29 pcaMethods_1.84.0 #>  affy_1.70.0 network_1.16.1 #>  KernSmooth_2.23-20 gridExtra_2.3 #>  IRanges_2.26.0 codetools_0.2-18 #>  MASS_7.3-54 gtools_3.8.2 #>  assertthat_0.2.1 SummarizedExperiment_1.22.0 #>  GenomeInfoDbData_1.2.6 hms_1.1.0 #>  grid_4.1.0 rpart_4.1-15 #>  tidyr_1.1.3 coda_0.19-4 #>  rmarkdown_2.8 MatrixGenerics_1.4.0 #>  base64enc_0.1-3