From the perspective of metabolites as the continuation of the central dogma of biology, metabolomics provides the closest link to many phenotypes of interest. This makes untargeted LC-MS metabolomics data promising in teasing apart the complexities of living systems. However, due to experimental reasons, the data includes non-wanted variation which limits quality and reproducibility, especially if the data is obtained from several batches.
The batchCorr package reduces unwanted variation by way of between-batch alignment, within-batch drift correction and between-batch normalization using batch-specific quality control (QC) samples and long-term reference QC samples. Please see the associated article (Brunius, Shi, and Landberg 2016) for more thorough descriptions of algorithms.
To install batchCorr
, install BiocManager first, if it is not installed.
Afterwards use the install function from BiocManager.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("batchCorr")
The example data allows for demonstration of all core batchCorr
functionality: between-batch alignment, within-batch drift correction and
between-batch normalization. The example data consisting of three batches from
a single analytical mode consists of three objects: PTnofill
(non-imputed/filled abundances, matrix), PTfilled (imputed/filled abundances)
and meta (sample and feature metadata, data.frame).
library(batchCorr)
data("ThreeBatchData")
batchCorr
was originally designed to work with basic data structures but the
three main methods, alignBatches
, correctDrift
and
normalizeBatches
also support SummarizedExperiment. This improves
interoperability with other Bioconductor packages. This includes xcms for
preprocessing, the qmtools, phenomis and/or pmp packages to complement
normalization and quality control as well as statistical tests, machine
learning and annotation.
Abundances are included in a matrix, while sample and feature data is included
in a data.frame. batchCorr
works best as per the chronology presented below,
where both the original functionality and SummarizedExperiment-functionality is
demonstrated.
Utility functions for the original functionality include:
peakInfo
to extract m/z and rt from peak table based on a separator in
rownamesgetBatch
to extract specific batch from a list with a peak table and
metadatamergeBatches
to merge batches after drift correctionImportant analytical background includes batch-specific QC samples and long-term reference QC samples, which are regularly interspersed in the injection sequence. Batch-specific QC samples are typically pooled aliquots of study samples, and are used for within-batch drift correction. Long-term reference QC samples are not of the same biological origin as the batch-specific QC samples, and are therefore not directly representative of the sample population. Long-term reference QC samples are used for between-batch alignment, within-batch drift correction and between-batch normalization.
Let’s create a SummarizedExperiment object in order to demonstrate the original
functionality and the new SummarizedExperiment methods in parallel.
SummarizedExperiment may also be output from xcms-based preprocessing using
xcms::quantify()
.
peaks <- SimpleList(t(PTnofill), t(PTfill))
sampleData <- meta
featureData <- peakInfo(PT = PTnofill, sep = "@", start = 3)
rownames(featureData) <- rownames(peaks[[1]])
se <- SummarizedExperiment(assays = peaks, colData = sampleData,
rowData = featureData)
names(assays(se)) <- c("nofill", "fill")
Below, we focus on basic usage. The list output for basic data structures includes processing metadata which can be used for troubleshooting. The SummarizedExperiment methods return objects with modified peak tables. Please see the documentation for more information (for example, ?alignBatches).
Shifts in retention time (RT) and mass-to-charge ratio (m/z) across batches
results in some metabolites being redundantly represented in the dataset.
To rectify this, between-batch alignment of features is performed using
alignBatches
, which encompasses the following steps:
NA
s to
all samples > 80% based on long-term reference QC samples# Extract peakinfo (i.e. m/z and rt of features),
# These column names have 2 leading characters describing LC-MS mode
# -> start at 3
peakIn <- peakInfo(PT = PTnofill, sep = "@", start = 3)
# Perform multi-batch alignment
alignBat <- alignBatches(
peakInfo = peakIn, PeakTabNoFill = PTnofill,
PeakTabFilled = PTfill, batches = meta$batch,
sampleGroups = meta$grp, selectGroup = "QC",
report = FALSE
)
# Extract new peak table
PT <- alignBat$PTalign
Below, we use SummarizedExperiment with multiple peak tables. When using
SummarizedExperiment sequentially such that a single peak table is replaced by
a new one, one doesn’t need to specify the assay.type
or name
parameters.
The assay is added to the SummarizedExperiment supplied to PeakTabFilled
.
se <- alignBatches(PeakTabNoFill = se, PeakTabFilled = se, batches = "batch",
sampleGroups = "grp", report = FALSE, assay.type1 = "nofill",
assay.type2 = "fill", name = "aligned", rt_col = "rt", mz_col = "mz")
Drift in abundance within a batch gives rise to unwanted variation which can be
modelled in terms of injection order. Many methods fail to take into account
different drift patterns in features or are prone to overfitting. Herein,
within-batch drift correction is performed using the wrapper correctDrift
,
which involves:
Fitting a cubic spline and calculation of correction factor
Correction of the abundances using correction factor
The mixture models used to cluster the drift patterns in correctDrift()
can
fail to converge for some combinations of geometry and cluster number. This
results in missing Bayesian Information Criterion (BIC) measures for some
models. You can check from which converged models the final model was selected
using the “BIC” element in the output for basic data structures. To learn more
about the model-based clustering used herein, refer to the mclust (e)book
(Scrucca et al. 2023).
# Batch B
batchB <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "B"
)
BCorr <- correctDrift(
peakTable = batchB$peakTable,
injections = batchB$meta$inj,
sampleGroups = batchB$meta$grp, QCID = "QC",
G = seq(5, 35, by = 3), modelNames = c("VVE", "VEE"),
report = FALSE
)
# Batch F
batchF <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "F"
)
FCorr <- correctDrift(
peakTable = batchF$peakTable,
injections = batchF$meta$inj,
sampleGroups = batchF$meta$grp,
QCID = "QC", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"),
report = FALSE
)
# Batch H
batchH <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "H"
)
HCorr <- correctDrift(
peakTable = batchH$peakTable,
injections = batchH$meta$inj,
sampleGroups = batchH$meta$grp,
QCID = "QC", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"),
report = FALSE
)
HCorr$BIC
## Bayesian Information Criterion (BIC):
## VVE VEE
## 5 7471.557 7028.343
## 8 7608.446 7460.302
## 11 7471.978 7456.081
##
## Top 3 models based on the BIC criterion:
## VVE,8 VVE,11 VVE,5
## 7608.446 7471.978 7471.557
Similarly, we subset the SummarizedExperiment by batch and perform drift correction, but for this example using long-term reference samples for quality indicators.
batch_labels <- unique(colData(se)$batch)
batches <- lapply(batch_labels, function(batch_label) {
se[, colData(se)$batch == batch_label]
})
batches <- lapply(batches, correctDrift, injections = "inj",
sampleGroups = "grp", RefID = "Ref", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"), report = FALSE,
assay.type = "aligned", name = "corrected")
normalizeBatches
performs between-batch normalization either based on
long-term reference QC samples or median batch intensity depending on the
following dual criterion:
If the long-term QC samples are not considered reliable according to the above
dual criterion for a specific feature, batches were normalized by sample
population median, where a sample population can be specified explicitly to
the population
argument. Features not present in all batches are also
excluded from the dataset.
mergedData <- mergeBatches(list(BCorr, FCorr, HCorr), qualRatio = 0.5)
normData <- normalizeBatches(
peakTableCorr = mergedData$peakTableCorr,
batches = meta$batch, sampleGroup = meta$grp,
refGroup = "Ref", population = "all"
)
PTnorm <- normData$peakTable
Merging with mergeBatches
, as above, includes a quality control step, where
features with CV > the limit (supplied to correctDrift()) in a specified
proportion of batches (default = 0.5) are excluded. For SummarizedExperiment,
we join the batches, keeping features shared across all batches but without
filtering features by proportion of quality batches.
se <- do.call(cbind, batches)
se <- se[which(apply(assay(se), 1, function(x) any(is.na(x)))), ]
se <- normalizeBatches(peakTableCorr = se,
batches = "batch", sampleGroup = "grp", refGroup = "Ref",
population = "all", assay.type = "corrected", name = "normalized")
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] batchCorr_0.99.8 SummarizedExperiment_1.39.1
## [3] Biobase_2.69.0 GenomicRanges_1.61.1
## [5] Seqinfo_0.99.2 IRanges_2.43.0
## [7] S4Vectors_0.47.0 BiocGenerics_0.55.0
## [9] generics_0.1.4 MatrixGenerics_1.21.0
## [11] matrixStats_1.5.0 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-3 jsonlite_2.0.0 compiler_4.5.1
## [4] BiocManager_1.30.26 crayon_1.5.3 Rcpp_1.1.0
## [7] parallel_4.5.1 jquerylib_0.1.4 BiocParallel_1.43.4
## [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-7
## [13] plyr_1.8.9 R6_2.6.1 XVector_0.49.0
## [16] S4Arrays_1.9.1 knitr_1.50 DelayedArray_0.35.2
## [19] bookdown_0.43 bslib_0.9.0 rlang_1.1.6
## [22] reshape_0.8.10 cachem_1.1.0 xfun_0.52
## [25] sass_0.4.10 SparseArray_1.9.1 cli_3.6.5
## [28] digest_0.6.37 grid_4.5.1 mclust_6.1.1
## [31] lifecycle_1.0.4 evaluate_1.0.4 codetools_0.2-20
## [34] abind_1.4-8 rmarkdown_2.29 tools_4.5.1
## [37] htmltools_0.5.8.1
Brunius, Carl, Lin Shi, and Rikard Landberg. 2016. “Large-Scale Untargeted Lc-Ms Metabolomics Data Correction Using Between-Batch Feature Alignment and Cluster-Based Within-Batch Signal Intensity Drift Correction.” Metabolomics 12: 1–13.
Scrucca, Luca, Chris Fraley, T. Brendan Murphy, and Adrian E. Raftery. 2023. Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman; Hall/CRC. https://doi.org/10.1201/9781003277965.