The statistics functionality in notameStats
aims to identify interesting
features across study groups. See the project example vignette in the notame
package and notame website reference index for listing of functions. Similar functionality is
available in several packages.
Unless otherwise stated, all functions return separate data frames or other
objects with the results. These can be then added to the object feature data
using join_rowData(object, results)
. The reason for not adding these to
the objects automatically is that most of the functions return excess
information that is not always worth saving. We encourage you to choose which
information is important to you.
To install notameStats
, install BiocManager
first, if it is not installed.
Afterwards use the install
function from BiocManager
and load notameStats
.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("notameStats")
library(notame)
library(notameViz)
library(notameStats)
ppath <- tempdir()
init_log(log_file = file.path(ppath, "log.txt"))
## INFO [2025-10-10 19:14:29] Starting logging
data(hilic_neg_sample, package = "notame")
data(toy_notame_set, package = "notame")
It is straightforward to provide summary statistics and effect sizes for all features:
toy_notame_set <- mark_nas(toy_notame_set, value = 0)
# Impute missing values, required especially for multivariate methods
toy_notame_set <- notame::impute_rf(toy_notame_set)
## INFO [2025-10-10 19:14:30]
## Starting random forest imputation at 2025-10-10 19:14:30.14515
## INFO [2025-10-10 19:14:34] Out-of-bag error in random forest imputation: 0.466
## INFO [2025-10-10 19:14:34] Random forest imputation finished at 2025-10-10 19:14:34.609549
sum_stats <- summary_statistics(toy_notame_set, grouping_cols = "Group")
toy_notame_set <- notame::join_rowData(toy_notame_set, sum_stats)
d_results <- cohens_d(toy_notame_set, group = "Group")
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups B & A
## INFO [2025-10-10 19:14:37] Cohen's D computed.
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups QC & A
## INFO [2025-10-10 19:14:37] Cohen's D computed.
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups QC & B
## INFO [2025-10-10 19:14:37] Cohen's D computed.
toy_notame_set <- notame::join_rowData(toy_notame_set, d_results)
fc <- fold_change(toy_notame_set, group = "Group")
## INFO [2025-10-10 19:14:37] Starting to compute fold changes.
## INFO [2025-10-10 19:14:38] Fold changes computed.
toy_notame_set <- notame::join_rowData(toy_notame_set, fc)
colnames(rowData(toy_notame_set))
## [1] "Feature_ID" "Split" "Alignment" "Average_Mz"
## [5] "Average_Rt_min" "Column" "Ion_mode" "Flag"
## [9] "A_mean" "A_sd" "A_median" "A_mad"
## [13] "A_min" "A_Q25" "A_Q75" "A_max"
## [17] "B_mean" "B_sd" "B_median" "B_mad"
## [21] "B_min" "B_Q25" "B_Q75" "B_max"
## [25] "QC_mean" "QC_sd" "QC_median" "QC_mad"
## [29] "QC_min" "QC_Q25" "QC_Q75" "QC_max"
## [33] "B_vs_A_Cohen_d" "QC_vs_A_Cohen_d" "QC_vs_B_Cohen_d" "B_vs_A_FC"
## [37] "QC_vs_A_FC" "QC_vs_B_FC"
These functions perform univariate hypothesis tests for each feature, report
relevant statistics and correct the p-values using FDR correction. For
features, where the model fails for some reason, all statistics are recorded as
NA. NOTE setting all_features = FALSE
does not prevent the tests on
the flagged compounds, but only affects p-value correction, where flagged
features are not included in the correction and thus do not have an FDR-
corrected p-value. To prevent the testing of flagged features altogether, use
notame::drop_flagged
before the tests.
Most of the univariate statistical test functions in this package use the formula interface, where the formula is provided as a character, with one special condition: the word “Feature” will get replaced at each iteration by the corresponding feature name. So for example, when testing if any of the features predict the difference between study groups, the formula would be: “Group ~ Feature”. Or, when testing if group and time point affect metabolite levels, the formula could be “Feature ~ Group + Time + Group:Time”, with the last term being an interaction term (“Feature ~ Group * Time” is equivalent).
toy_notame_set <- notame::flag_quality(toy_notame_set)
## INFO [2025-10-10 19:14:38]
## 96% of features flagged for low quality
toy_notame_set <- notame::drop_qcs(toy_notame_set)
lm_results <- perform_lm(toy_notame_set,
formula_char = "Feature ~ Group + Time")
## INFO [2025-10-10 19:14:38] Starting linear regression.
## INFO [2025-10-10 19:14:39] Linear regression performed.
Most of the functions allow you to pass extra arguments to the underlying functions performing the actual tests, so you can set custom contrasts etc.
Functions not using the formula interface include correlation tests between
molecular features and/or sample information variable
(perform_correlation_tests()
) and area under curve computation
(perform_auc()
).
notame provides a wrapper for the MUVR analysis (Multivariate methods with
Unbiased Variable selection in R, [shi2019variable] using the MUVR2 package.
MUVR2 allows fitting both RF and PLS models with clever variable selection for
both finding a minimal subset of features that achieves a good performance AND
for finding all relevant features. There is also a set of useful visualizations
in MUVR2
.
# nRep = 2 for quick example
pls_model <- muvr_analysis(toy_notame_set,
y = "Injection_order", nRep = 2, method = "PLS")
## Warning in (function (X, Y, ID, scale = TRUE, nRep = 5, nOuter = 6, nInner, :
## Missing ID -> Assume all unique (i.e. sample independence)
## Warning: executing %dopar% sequentially: no parallel backend registered
class(pls_model)
## [1] "MUVR" "Regression" "PLS"
For random forest models, we also use the randomForest
package. We also
include a wrapper for getting feature importance.
rf <- fit_rf(toy_notame_set, y = "Group")
class(rf)
## [1] "randomForest"
head(importance_rf(rf))
## Feature_ID A B
## HILIC_neg_118_9111a4_1865 HILIC_neg_118_9111a4_1865 0.01552720 0.009375397
## HILIC_pos_255_0094a7_9288 HILIC_pos_255_0094a7_9288 -0.03925466 -0.002480875
## RP_neg_139_4456a4_1251 RP_neg_139_4456a4_1251 0.02227511 -0.014353918
## MeanDecreaseAccuracy MeanDecreaseGini
## HILIC_neg_118_9111a4_1865 0.012278411 6.673016
## HILIC_pos_255_0094a7_9288 -0.020885661 6.102022
## RP_neg_139_4456a4_1251 0.003800981 6.743762
There are also wrappers for PLS(-DA) functions from the mixOmics package.
pls_res <- mixomics_pls(toy_notame_set, y = "Injection_order", ncomp = 3)
## INFO [2025-10-10 19:14:41] Fitting PLS
class(pls_res)
## [1] "mixo_pls"
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pROC_1.19.0.1 notameStats_0.99.1
## [3] notameViz_0.99.5 notame_0.99.6
## [5] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [7] GenomicRanges_1.61.5 Seqinfo_0.99.2
## [9] IRanges_2.43.5 S4Vectors_0.47.4
## [11] BiocGenerics_0.55.3 generics_0.1.4
## [13] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [15] ggplot2_4.0.0 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] mnormt_2.1.1 gridExtra_2.3 formatR_1.14
## [4] rlang_1.1.6 magrittr_2.0.4 compiler_4.5.1
## [7] mgcv_1.9-3 reshape2_1.4.4 vctrs_0.6.5
## [10] MUVR2_0.1.0 stringr_1.5.2 pkgconfig_2.0.3
## [13] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [16] XVector_0.49.1 rmarkdown_2.30 prodlim_2025.04.28
## [19] itertools_0.1-3 purrr_1.1.0 xfun_0.53
## [22] glmnet_4.1-10 randomForest_4.7-1.2 cachem_1.1.0
## [25] jsonlite_2.0.0 recipes_1.3.1 DelayedArray_0.35.3
## [28] BiocParallel_1.43.4 psych_2.5.6 parallel_4.5.1
## [31] R6_2.6.1 stringi_1.8.7 bslib_0.9.0
## [34] RColorBrewer_1.1-3 ranger_0.17.0 parallelly_1.45.1
## [37] rpart_4.1.24 lubridate_1.9.4 jquerylib_0.1.4
## [40] Rcpp_1.1.0 bookdown_0.45 iterators_1.0.14
## [43] knitr_1.50 future.apply_1.20.0 igraph_2.1.4
## [46] timechange_0.3.0 Matrix_1.7-4 splines_4.5.1
## [49] nnet_7.3-20 tidyselect_1.2.1 dichromat_2.0-0.1
## [52] abind_1.4-8 yaml_2.3.10 timeDate_4041.110
## [55] doParallel_1.0.17 codetools_0.2-20 listenv_0.9.1
## [58] doRNG_1.8.6.2 lattice_0.22-7 tibble_3.3.0
## [61] plyr_1.8.9 rARPACK_0.11-0 withr_3.0.2
## [64] S7_0.2.0 evaluate_1.0.5 future_1.67.0
## [67] lambda.r_1.2.4 survival_3.8-3 futile.logger_1.4.3
## [70] pillar_1.11.1 BiocManager_1.30.26 rngtools_1.5.2
## [73] foreach_1.5.2 ellipse_0.5.0 scales_1.4.0
## [76] globals_0.18.0 class_7.3-23 glue_1.8.0
## [79] tools_4.5.1 data.table_1.17.8 RSpectra_0.16-2
## [82] ModelMetrics_1.2.2.2 gower_1.0.2 grid_4.5.1
## [85] missForest_1.5 tidyr_1.3.1 ipred_0.9-15
## [88] nlme_3.1-168 cli_3.6.5 futile.options_1.0.1
## [91] mixOmics_6.33.0 S4Arrays_1.9.1 viridisLite_0.4.2
## [94] lava_1.8.1 dplyr_1.1.4 corpcor_1.6.10
## [97] gtable_0.3.6 sass_0.4.10 digest_0.6.37
## [100] caret_7.0-1 ggrepel_0.9.6 SparseArray_1.9.1
## [103] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [106] hardhat_1.4.2 MASS_7.3-65