1 Introduction

The qmtools package provides basic tools for imputation, normalization, and dimension-reduction of metabolomics data with the standard SummarizedExperiment class. It also offers several helper functions to assist visualization of data. This vignette gives brief descriptions of these tools with toy examples.

2 Installation

The package can be installed using BiocManager. In R session, please type BiocManager::install("qmtools").

3 Data preparation

To demonstrate the use of the qmtools functions, we will use the FAAH knockout LC/MS data, containing quantified LC/MS peaks from the spinal cords of 6 wild-type and 6 FAAH (fatty acid amide hydrolase) knockout mice.

library(qmtools)
library(SummarizedExperiment)
library(vsn)
library(pls)
library(ggplot2)
library(patchwork)
set.seed(1e8)

data(faahko_se)

## Only keep the first assay for the vignette
assays(faahko_se)[2:4] <- NULL
faahko_se
#> class: SummarizedExperiment 
#> dim: 206 12 
#> metadata(0):
#> assays(1): raw
#> rownames(206): FT001 FT002 ... FT205 FT206
#> rowData names(10): mzmed mzmin ... WT peakidx
#> colnames(12): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
#> colData names(2): sample_name sample_group

4 Feature filtering

Metabolomics data often contains a large number of uninformative features that can hinder downstream analysis. The removeFeatures function attempts to identify such features and remove them from the data based on missing values, quality control (QC) replicates, and blank samples with the following methods:

  • Proportions of missing values: retain features if there is at least one group with a proportion of non-missing values above a cut-off.

  • Relative standard deviation: remove features if QC replicates show low reproducibility.

  • Intraclass correlation coefficient (ICC): retain features if a feature has relatively high variability across biological samples compared to QC replicates.

  • QC/blank ratio: remove features with low abundance that may have non-biological origin.

The FAAH knockout data does not include QC and blank samples. Here, we just illustrate missing value-based filtering.

dim(faahko_se) # 206 features
#> [1] 206  12
table(colData(faahko_se)$sample_group)
#> 
#> KO WT 
#>  6  6

## Missing value filter based on 2 groups.
dim(removeFeatures(faahko_se, i = "raw",
                   group = colData(faahko_se)$sample_group,
                   cut = 0.80)) # nothing removed
#> [1] 206  12

dim(removeFeatures(faahko_se, i = "raw",
                   group = colData(faahko_se)$sample_group,
                   cut = 0.85)) # removed 65 features
#> [1] 141  12

## based on "WT" only
dim(removeFeatures(faahko_se, i = "raw",
                   group = colData(faahko_se)$sample_group,
                   levels = "WT", cut = 0.85))
#> [1] 104  12

In this vignette, we kept all features based on the cut-off: at least one group contains >= 80% of non-missing values.

5 Imputation

Missing values are common in metabolomics data. For example, ions may have a low abundance that does not reach the limit of detection of the instrument. Unexpected stochastic fluctuations and technical error may also cause missing values even though ions present at detectable levels. We could use the plotMiss function to explore the mechanisms generating the missing values. The bar plot on the left panel shows the amount of missing values in each samples and the right panel helps to identify the structure of missing values with a hierarchically-clustered heatmap.

## Sample group information
g <- factor(colData(faahko_se)$sample_group, levels = c("WT", "KO"))

## Visualization of missing values
plotMiss(faahko_se, i = "raw", group = g)

Overall, the knockout mice have a higher percentage of missing values. The features on top of the heatmap in general only present at the knockout mice, suggesting that some of missing values are at least not random (perhaps due to altered metabolisms by the experimental condition). In almost all cases, visualization and inspection of missing values are a time-intensive step, but greatly improve the ability to uncover the nature of missing values in data and help to choose an appropriate imputation method.

The imputation of missing values can be done with the imputeIntensity function. Several imputation methods are available such as k-Nearest Neighbor (kNN), Random Forest (RF), Bayesian PCA, and other methods available in MsCoreUtils. By default, the kNN is used to impute missing values using the Gower distance. The kNN is a distance-based algorithm that typically requires to scale the data to avoid variance-based weighing. Since the Gower distance used, the imputation can be performed with the original scales, which may be helpful to non-technical users.

se <- imputeIntensity(faahko_se, i = "raw", name = "knn", method = "knn")
se # The result was stored in assays slot: "knn"
#> class: SummarizedExperiment 
#> dim: 206 12 
#> metadata(0):
#> assays(2): raw knn
#> rownames(206): FT001 FT002 ... FT205 FT206
#> rowData names(10): mzmed mzmin ... WT peakidx
#> colnames(12): ko15.CDF ko16.CDF ... wt21.CDF wt22.CDF
#> colData names(2): sample_name sample_group

## Standardization of input does not influence the result
m <- assay(faahko_se, "raw")
knn_scaled <- as.data.frame(
    imputeIntensity(scale(m), method = "knn") # Can accept matrix as an input
)

knn_unscaled <- as.data.frame(assay(se, "knn"))

idx <- which(is.na(m[, 1]) | is.na(m[, 2])) # indices for missing values
p1 <- ggplot(knn_unscaled[idx, ], aes(x = ko15.CDF, y = ko16.CDF)) +
    geom_point() + theme_bw()
p2 <- ggplot(knn_scaled[idx, ], aes(x = ko15.CDF, y = ko16.CDF)) +
    geom_point() + theme_bw()
p1 + p2 + plot_annotation(title = "Imputed values: unscaled vs scaled")