1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("pmp")
library(pmp)
library(SummarizedExperiment)
library(S4Vectors)

2 Introduction

Metabolomics data processing workflows consist of multiple steps including peak picking or raw data processing, quality assurance, missing value imputation, normalisation and scaling. Several tools (commercial, R and non-R based) are commonly used for raw data processing which generate outputs in the form of a two dimensional data matrix and meta data.

These outputs contain hundreds or thousands of so called uninformative or unreproducible features. Such features could strongly hinder outputs of subsequent statistical analysis, biomarker discovery or metabolic pathway inference. Common practice is to apply peak matrix validation and filtering procedures as described in Di Guida et al. (2016), Broadhurst et al. (2018) and Schiffman et al. (2019).

Functions within pmp package are designed to help users to prepare data for further statistical data analysis in fast, easy to use and reproducible manner.

This document showcases the commonly used peak matrix processing steps of metabolomics data sets.

3 Data formats

Recent review for R packages in metabolomics (Stanstrup et al. 2019) covers a broad range of heterogenous tools is availiable as part of Bioconductor sofware collection or on CRAN, Github and similar public repositories. pmp package utilises SummarizedExperiment class from Bioconductor for data input and output.

For example, outputs from widely used xcms package can be relatively easy converted to SummarizedExperiment object using functions featureDefinitions, featureValues and pData on xcms output object.

Additioanlly pmp supports to input data to be any matrix-like R data structure (e.g. and ordinary matrix, a data frame). If input if a matrix-like structure tools from pmp package will perform several checks for data integrity as well. Please see section 10 for more details.

4 Example data set, MTBLS79

In this tutorial we will be using an direct infusion mass spectrometry (DIMS) data set consisting of 172 samples measured across 8 batches and is included in pmp package as SummarizedExperiemnt class object MTBLS79. More detailed description of the data set is available from Kirwan et al. (2014), MTBLS79 and R man page.

help ("MTBLS79")
data(MTBLS79)
MTBLS79
#> class: SummarizedExperiment 
#> dim: 2488 172 
#> metadata(0):
#> assays(1): ''
#> rownames(2488): 70.03364 70.03375 ... 569.36369 572.36537
#> rowData names(0):
#> colnames(172): batch01_QC01 batch01_QC02 ... Batch08_C09 Batch08_QC39
#> colData names(4): Batch Sample_Rep Class Class2

This data set before peak matrix filtering contains 172 samples, 2488 features and 18222 missing values across all samples what is roughly around 4.2%.

sum(is.na(assay(MTBLS79)))
#> [1] 18222
sum(is.na(assay(MTBLS79))) / length(assay(MTBLS79)) * 100
#> [1] 4.258113

5 Filtering a data set

Missing values in the data set can be filtered across samples or features. The command below will remove all samples with more than 10 % missing values.

MTBLS79_filtered <- filter_samples_by_mv(df=MTBLS79, max_perc_mv=0.1)

MTBLS79_filtered
#> class: SummarizedExperiment 
#> dim: 2488 170 
#> metadata(2): original_data_structure processing_history
#> assays(1): ''
#> rownames(2488): 70.03364 70.03375 ... 569.36369 572.36537
#> rowData names(0):
#> colnames(170): batch01_QC01 batch01_QC02 ... Batch08_C09 Batch08_QC39
#> colData names(6): Batch Sample_Rep ... filter_samples_by_mv_perc
#>   filter_samples_by_mv_flags

sum(is.na(assay(MTBLS79_filtered)))
#> [1] 17588

Missing values sample filter has removed two samples from the initial data set. Outputs from any pmp function can be used as inputs for another function. For example we can apply missing value filter across features on the output of the previous command. Command below will filter only within quality control (QC) sample group.

MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9, 
    classes=MTBLS79_filtered$Class, method="QC", qc_label="QC")

MTBLS79_filtered
#> class: SummarizedExperiment 
#> dim: 2228 170 
#> metadata(2): original_data_structure processing_history
#> assays(1): ''
#> rownames(2228): 70.03364 70.03375 ... 559.15128 569.36369
#> rowData names(2): fraction_QC fraction_flag_QC
#> colnames(170): batch01_QC01 batch01_QC02 ... Batch08_C09 Batch08_QC39
#> colData names(6): Batch Sample_Rep ... filter_samples_by_mv_perc
#>   filter_samples_by_mv_flags

sum(is.na(assay(MTBLS79_filtered)))
#> [1] 11518

Similarly as we did before, we can add another filter on previous result. At this we will use the same filter, but now missing values wil be calculated across all samples and not only within “QC” group.

MTBLS79_filtered <- filter_peaks_by_fraction(df=MTBLS79_filtered, min_frac=0.9, 
    classes=MTBLS79_filtered$Class, method="across")

MTBLS79_filtered
#> class: SummarizedExperiment 
#> dim: 1957 170 
#> metadata(2): original_data_structure processing_history
#> assays(1): ''
#> rownames(1957): 70.03364 70.03375 ... 559.15128 569.36369
#> rowData names(4): fraction_QC fraction_flag_QC fraction fraction_flags
#> colnames(170): batch01_QC01 batch01_QC02 ... Batch08_C09 Batch08_QC39
#> colData names(6): Batch Sample_Rep ... filter_samples_by_mv_perc
#>   filter_samples_by_mv_flags

sum(is.na(assay(MTBLS79_filtered)))
#> [1] 4779

Applying these 3 filters has reduced number of missing values from 18222 to 4779.

Commonly used approach in metabolomics studies is to filter features by the by coefficient of variation (CV) or RSD% of QC samples.Example below will use 30% threshold.

MTBLS79_filtered <- filter_peaks_by_rsd(df=MTBLS79_filtered, max_rsd=30, 
    classes=MTBLS79_filtered$Class, qc_label="QC")

MTBLS79_filtered
#> class: SummarizedExperiment 
#> dim: 1386 170 
#> metadata(2): original_data_structure processing_history
#> assays(1): ''
#> rownames(1386): 70.03375 70.03413 ... 527.18345 559.15128
#> rowData names(6): fraction_QC fraction_flag_QC ... rsd_QC rsd_flags
#> colnames(170): batch01_QC01 batch01_QC02 ... Batch08_C09 Batch08_QC39
#> colData names(6): Batch Sample_Rep ... filter_samples_by_mv_perc
#>   filter_samples_by_mv_flags

sum(is.na(assay(MTBLS79_filtered)))
#> [1] 3040

6 Processing history

Every funcition in pmp provides history of applied parameter values. If user has saved outputs from R sessesion, it’s easy to check what commands were executed.

processing_history(MTBLS79_filtered)
#> $filter_samples_by_mv
#> $filter_samples_by_mv$max_perc_mv
#> [1] 0.1
#> 
#> $filter_samples_by_mv$remove_samples
#> [1] TRUE
#> 
#> 
#> $filter_peaks_by_fraction_QC
#> $filter_peaks_by_fraction_QC$min_frac
#> [1] 0.9
#> 
#> $filter_peaks_by_fraction_QC$method
#> [1] "QC"
#> 
#> $filter_peaks_by_fraction_QC$qc_label
#> [1] "QC"
#> 
#> $filter_peaks_by_fraction_QC$remove_peaks
#> [1] TRUE
#> 
#> 
#> $filter_peaks_by_fraction_across
#> $filter_peaks_by_fraction_across$min_frac
#> [1] 0.9
#> 
#> $filter_peaks_by_fraction_across$method
#> [1] "across"
#> 
#> $filter_peaks_by_fraction_across$qc_label
#> [1] "QC"
#> 
#> $filter_peaks_by_fraction_across$remove_peaks
#> [1] TRUE
#> 
#> 
#> $filter_peaks_by_rsd
#> $filter_peaks_by_rsd$max_rsd
#> [1] 30
#> 
#> $filter_peaks_by_rsd$qc_label
#> [1] "QC"
#> 
#> $filter_peaks_by_rsd$remove_peaks
#> [1] TRUE

7 Data normalisation

Probabilistic quotient normalisation (PQN) and normalisation the the total signal intensity methods are implemented for normalisation of biological variability across measured samples. Example below demonstrates how to apply PQN method.

MTBLS79_pqn_normalised <- pqn_normalisation(df=MTBLS79_filtered, 
    classes=MTBLS79_filtered$Class, qc_label="QC")

8 Missing value imputation

Several commonly used missing value imputation algorithms. Supported methods are k-nearest neighbours (knn), random forests (rf), Bayesian PCA missing value estimator (bpca), mean or median value of the given feature and constant small value. Within mv_imputaion interface user can easily apply different mehtod without worrying about input data type or tranposing data set.

MTBLS79_mv_imputed <- mv_imputation(df=MTBLS79_pqn_normalised,
    method="knn")

9 Data scaling

Variance stabilising generalised logarithm transformation (glog) algorithm is implimented to help to minimise contributions from unwanted technical variaton of sample collection.

MTBLS79_glog <- glog_transformation(df=MTBLS79_mv_imputed,
    classes=MTBLS79_filtered$Class, qc_label="QC")

glog_transformation function uses QC samples to optimse scaling factor lambda. Using function glog_plot_plot_optimised_lambda it’s possibe to visualise if optimsation of the given parameter has converged at the minima.

opt_lambda <- 
    processing_history(MTBLS79_glog)$glog_transformation$lambda_opt
glog_plot_optimised_lambda(df=MTBLS79_mv_imputed,
    optimised_lambda=opt_lambda,
    classes=MTBLS79_filtered$Class, qc_label="QC")

10 Data integrity check and endomorphisms

Function in pmp package are designed to validate input data if user chose not to use SummarizedExperiment class objet. For example, if input is matrix with features stored in columns and sample in rows, any function of pmp package will be able to handle this object.

peak_matrix <- t(assay(MTBLS79))
sample_classes <- MTBLS79$Class

class(peak_matrix)
#> [1] "matrix" "array"
dim(peak_matrix)
#> [1]  172 2488
class(sample_classes)
#> [1] "character"

Let’s try to use these objects as input for mv_imputation and filter_peaks_by_rsd.

mv_imputed <- mv_imputation(df=peak_matrix, method="mn")
#> Warning in check_peak_matrix(df = df, classes = classes): Peak table was transposed to have features as rows and samples
#>     in columns. 
#> 
#>     There were no class labels available please check that peak table is 
#> 
#>     still properly rotated. 
#> 
#>     Use 'check_df=FALSE' to keep original peak matrix orientation.
rsd_filtered <- filter_peaks_by_rsd(df=mv_imputed, max_rsd=30, 
    classes=sample_classes, qc_label="QC")

class (mv_imputed)
#> [1] "matrix" "array"
dim (mv_imputed)
#> [1] 2488  172

class (rsd_filtered)
#> [1] "matrix" "array"
dim (rsd_filtered)
#> [1] 1630  172

Note that pmp has automatically transposed input object to use largest dimension as features, while original R data type matrix has been retained also for function output.

11 Session information

sessionInfo()
#> R Under development (unstable) (2020-03-12 r77934)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.17.3 DelayedArray_0.13.7        
#>  [3] BiocParallel_1.21.2         matrixStats_0.56.0         
#>  [5] Biobase_2.47.3              GenomicRanges_1.39.2       
#>  [7] GenomeInfoDb_1.23.13        IRanges_2.21.4             
#>  [9] S4Vectors_0.25.13           BiocGenerics_0.33.0        
#> [11] pmp_0.99.3                  BiocStyle_2.15.6           
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3             lattice_0.20-40        assertthat_0.2.1      
#>  [4] digest_0.6.25          foreach_1.4.8          R6_2.4.1              
#>  [7] plyr_1.8.6             evaluate_0.14          ggplot2_3.3.0         
#> [10] pillar_1.4.3           itertools_0.1-3        zlibbioc_1.33.1       
#> [13] rlang_0.4.5            magick_2.3             Matrix_1.2-18         
#> [16] missForest_1.4         rmarkdown_2.1          labeling_0.3          
#> [19] stringr_1.4.0          RCurl_1.98-1.1         munsell_0.5.0         
#> [22] compiler_4.0.0         xfun_0.12              pkgconfig_2.0.3       
#> [25] pcaMethods_1.79.1      htmltools_0.4.0        tidyselect_1.0.0      
#> [28] tibble_2.1.3           GenomeInfoDbData_1.2.2 bookdown_0.18         
#> [31] codetools_0.2-16       randomForest_4.6-14    crayon_1.3.4          
#> [34] dplyr_0.8.5            bitops_1.0-6           grid_4.0.0            
#> [37] gtable_0.3.0           lifecycle_0.2.0        magrittr_1.5          
#> [40] scales_1.1.0           stringi_1.4.6          impute_1.61.0         
#> [43] farver_2.0.3           XVector_0.27.1         reshape2_1.4.3        
#> [46] iterators_1.0.12       tools_4.0.0            glue_1.3.2            
#> [49] purrr_0.3.3            yaml_2.2.1             colorspace_1.4-1      
#> [52] BiocManager_1.30.10    knitr_1.28

References

Broadhurst, David, Royston Goodacre, Stacey N Reinke, Julia Kuligowski, Ian D Wilson, Matthew R Lewis, and Warwick B Dunn. 2018. “Guidelines and Considerations for the Use of System Suitability and Quality Control Samples in Mass Spectrometry Assays Applied in Untargeted Clinical Metabolomic Studies.” Metabolomics 14 (6). Springer:72.

Di Guida, Riccardo, Jasper Engel, J William Allwood, Ralf JM Weber, Martin R Jones, Ulf Sommer, Mark R Viant, and Warwick B Dunn. 2016. “Non-Targeted Uhplc-Ms Metabolomic Data Processing Methods: A Comparative Investigation of Normalisation, Missing Value Imputation, Transformation and Scaling.” Metabolomics 12 (5). Springer:93.

Kirwan, Jennifer A, Ralf JM Weber, David I Broadhurst, and Mark R Viant. 2014. “Direct Infusion Mass Spectrometry Metabolomics Dataset: A Benchmark for Data Processing and Quality Control.” Scientific Data 1. Nature Publishing Group:140012.

Schiffman, Courtney, Lauren Petrick, Kelsi Perttula, Yukiko Yano, Henrik Carlsson, Todd Whitehead, Catherine Metayer, Josie Hayes, Stephen Rappaport, and Sandrine Dudoit. 2019. “Filtering Procedures for Untargeted LC-MS Metabolomics Data.” BMC Bioinformatics 20 (1):334. https://doi.org/10.1186/s12859-019-2871-9.

Stanstrup, Jan, Corey D. Broeckling, Rick Helmus, Nils Hoffmann, Ewy Mathé, Thomas Naake, Luca Nicolotti, et al. 2019. “The metaRbolomics Toolbox in Bioconductor and Beyond.” Metabolites 9 (10). https://doi.org/10.3390/metabo9100200.