This vignette aims to be a short tutorial for the main functionalities of
SIAMCAT. Examples of additional workflows or more detailed tutorials can
be found in other vignettes (see the
Associations between microbiome and host phenotypes are ideally described by
quantitative models able to predict host status from microbiome composition.
SIAMCAT can do so for data from hundreds of thousands of microbial taxa, gene
families, or metabolic pathways over hundreds of samples.
graphical output for convenient assessment of the quality of the input data and
statistical associations, for model diagnostics and inference revealing the
most predictive microbial biomarkers.
For this vignette, we use an example dataset included in the
As example dataset we use the data from the publication of
Zeller et al, which demonstrated
the potential of microbial species in fecal samples to distinguish patients
with colorectal cancer (CRC) from healthy controls.
library("SIAMCAT") data("feat_crc_zeller", package="SIAMCAT") data("meta_crc_zeller", package="SIAMCAT")
SIAMCAT needs a feature matrix (can be either a
data.frame, or a
phyloseq-otu_table), which contains values of different
features (in rows) for different samples (in columns). For example, the
feature matrix included here contains relative abundances for bacterial
species calculated with the mOTU profiler for 141 samples:
## CCIS27304052ST-3-0 CCIS15794887ST-4-0 ## UNMAPPED 0.589839 0.7142157 ## Methanoculleus marisnigri [h:1] 0.000000 0.0000000 ## Methanococcoides burtonii [h:10] 0.000000 0.0000000 ## CCIS74726977ST-3-0 ## UNMAPPED 0.7818674 ## Methanoculleus marisnigri [h:1] 0.0000000 ## Methanococcoides burtonii [h:10] 0.0000000
##  1754 141
Please note that
SIAMCATis supposed to work with relative abundances. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs.
Secondly, we also have metadata about the samples in another
## Age BMI Gender AJCC_stage FOBT Group ## CCIS27304052ST-3-0 52 20 F -1 Negative CTR ## CCIS15794887ST-4-0 37 18 F -1 Negative CTR ## CCIS74726977ST-3-0 66 24 M -1 Negative CTR ## CCIS16561622ST-4-0 54 26 M -1 Negative CTR ## CCIS79210440ST-3-0 65 30 M -1 Positive CTR ## CCIS82507866ST-3-0 57 24 M -1 Negative CTR
In order to tell
SIAMCAT, which samples are cancer cases and which are
healthy controls, we can construct a label object from the
Group column in
label.crc.zeller <- create.label(meta=meta.crc.zeller, label='Group', case='CRC')
## Label used as case: ## CRC ## Label used as control: ## CTR
## + finished create.label.from.metadata in 0.003 s
Now we have all the ingredients to create a
SIAMCAT object. Please have a
look at the vignette about input formats for more
information about supported formats and other ways to create a
sc.obj <- siamcat(feat=feat.crc.zeller, label=label.crc.zeller, meta=meta.crc.zeller)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.056 s
A few information about the
SIAMCAT object can be accessed with the
SIAMCAT builds on the
phyloseq data structure):
## siamcat-class object ## label() Label object: 88 CTR and 53 CRC samples ## ## contains phyloseq-class experiment-level object @phyloseq: ## phyloseq@otu_table() OTU Table: [ 1754 taxa and 141 samples ] ## phyloseq@sam_data() Sample Data: [ 141 samples by 6 sample variables ]
Since we have quite a lot of microbial species in the dataset at the moment, we
can perform unsupervised feature selection using the function
sc.obj <- filter.features(sc.obj, filter.method = 'abundance', cutoff = 0.001)
## Features successfully filtered
Associations between microbial species and the label can be tested
check.associations function. The function computes for each species
the significance using a non-parametric Wilcoxon test and different effect
sizes for the association (e.g. AUC or fold change).
sc.obj <- check.associations( sc.obj, sort.by = 'fc', alpha = 0.05, mult.corr = "fdr", detect.lim = 10 ^-6, plot.type = "quantile.box", panels = c("fc", "prevalence", "auroc"))
The function produces a pdf file as output, since the plot is optimized for a landscape DIN-A4 layout, but can also used to plot on an active graphic device, e.g. in RStudio. The resulting plot then looks like that:
One strength of
SIAMCAT is the versatile but easy-to-use interface for the
construction of machine learning models on the basis of microbial species.
SIAMCAT contains functions for data normalization, splitting the data into
cross-validation folds, training the model, and making predictions based on
cross-validation instances and the trained models.
Data normalization is performed with the
normalize.features function. Here,
we use the
log.unit method, but several other methods and customization
options are available (please check the documentation).
sc.obj <- normalize.features( sc.obj, norm.method = "log.unit", norm.param = list( log.n0 = 1e-06, n.p = 2, norm.margin = 1 ) )
## Features normalized successfully.
Preparation of the cross-validation fold is a crucial step in machine learning.
SIAMCAT greatly simplifies the set-up of cross-validation schemes, including
stratification of samples or keeping samples inseperable based on metadata.
For this small example, we choose a twice-repeated 5-fold cross-validation
scheme. The data-split will be saved in the
data_split slot of the
sc.obj <- create.data.split( sc.obj, num.folds = 5, num.resample = 2 )
## Features splitted for cross-validation successfully.
The actual model training is performed using the function
Again, multiple options for customization are available, ranging from the
machine learning method to the measure for model selection or customizable
parameter set for hyperparameter tuning.
sc.obj <- train.model( sc.obj, method = "lasso" )
The models are saved in the
model_list slot of the
SIAMCAT object. The
model building is performed using the
mlr R package. All models can easily be
# get information about the model type model_type(sc.obj)
##  "lasso"
# access the models models <- models(sc.obj) models[]
## Model for learner.id=classif.cvglmnet; learner.class=classif.cvglmnet ## Trained on: task.id = data; obs = 112; features = 207 ## Hyperparameters: nlambda=100,alpha=1
Using the data-split and the models trained in previous step, we can use the
make.predictions in order to apply the models on the test instances
in the data-split. The predictions will be saved in the
pred_matrix slot of
sc.obj <- make.predictions(sc.obj) pred_matrix <- pred_matrix(sc.obj)
## CV_rep1 CV_rep2 ## CCIS27304052ST-3-0 0.10011768 0.16309522 ## CCIS15794887ST-4-0 0.18626022 0.27379882 ## CCIS74726977ST-3-0 0.47634142 0.39204057 ## CCIS16561622ST-4-0 0.31825180 0.20415432 ## CCIS79210440ST-3-0 0.54648880 0.07606544 ## CCIS82507866ST-3-0 0.04439485 0.09244968
In the final part, we want to find out how well the model performed and which
microbial species had been selected in the model. In order to do so, we first
calculate how well the predictions fit the real data using the function
evaluate.predictions. This function calculates the Area Under the Receiver
Operating Characteristic (ROC) Curve (AU-ROC) and the Precision Recall (PR)
Curve for each resampled cross-validation run.
sc.obj <- evaluate.predictions(sc.obj)
## Evaluated predictions successfully.
To plot the results of the evaluation, we can use the function
model.evaluation.plot, which produces a pdf-file showing the ROC and PR
Curves for the different resamples runs as well as the mean ROC and PR Curve.