This vignette aims to be a short tutorial for the main functionalities of
SIAMCAT
. Examples of additional workflows or more detailed tutorials can
be found in other vignettes (see the
BioConductor page).
SIAMCAT
is part of the suite of computational microbiome analysis tools
hosted at EMBL by the groups of
Peer Bork and
Georg Zeller. Find
out more at EMBL-microbiome tools.
Associations between microbiome and host phenotypes are ideally described by
quantitative models able to predict host status from microbiome composition.
SIAMCAT
can do so for data from hundreds of thousands of microbial taxa, gene
families, or metabolic pathways over hundreds of samples. SIAMCAT
produces
graphical output for convenient assessment of the quality of the input data and
statistical associations, for model diagnostics and inference revealing the
most predictive microbial biomarkers.
For this vignette, we use an example dataset included in the SIAMCAT
package.
As example dataset we use the data from the publication of
Zeller et al, which demonstrated
the potential of microbial species in fecal samples to distinguish patients
with colorectal cancer (CRC) from healthy controls.
library("SIAMCAT")
data("feat_crc_zeller", package="SIAMCAT")
data("meta_crc_zeller", package="SIAMCAT")
First, SIAMCAT
needs a feature matrix (can be either a matrix
, a
data.frame
, or a phyloseq-otu_table
), which contains values of different
features (in rows) for different samples (in columns). For example, the
feature matrix included here contains relative abundances for bacterial
species calculated with the mOTU profiler for 141 samples:
feat.crc.zeller[1:3, 1:3]
## CCIS27304052ST-3-0 CCIS15794887ST-4-0
## UNMAPPED 0.589839 0.7142157
## Methanoculleus marisnigri [h:1] 0.000000 0.0000000
## Methanococcoides burtonii [h:10] 0.000000 0.0000000
## CCIS74726977ST-3-0
## UNMAPPED 0.7818674
## Methanoculleus marisnigri [h:1] 0.0000000
## Methanococcoides burtonii [h:10] 0.0000000
dim(feat.crc.zeller)
## [1] 1754 141
Please note that
SIAMCAT
is supposed to work with relative abundances. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs.
Secondly, we also have metadata about the samples in another data.frame
:
head(meta.crc.zeller)
## Age BMI Gender AJCC_stage FOBT Group
## CCIS27304052ST-3-0 52 20 F -1 Negative CTR
## CCIS15794887ST-4-0 37 18 F -1 Negative CTR
## CCIS74726977ST-3-0 66 24 M -1 Negative CTR
## CCIS16561622ST-4-0 54 26 M -1 Negative CTR
## CCIS79210440ST-3-0 65 30 M -1 Positive CTR
## CCIS82507866ST-3-0 57 24 M -1 Negative CTR
In order to tell SIAMCAT
, which samples are cancer cases and which are
healthy controls, we can construct a label object from the Group
column in
the metadata.
label.crc.zeller <- create.label(meta=meta.crc.zeller,
label='Group', case='CRC')
## Label used as case:
## CRC
## Label used as control:
## CTR
## + finished create.label.from.metadata in 0.001 s
Now we have all the ingredients to create a SIAMCAT
object. Please have a
look at the vignette about input formats for more
information about supported formats and other ways to create a SIAMCAT
object.
sc.obj <- siamcat(feat=feat.crc.zeller,
label=label.crc.zeller,
meta=meta.crc.zeller)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.031 s
A few information about the SIAMCAT
object can be accessed with the show
function from phyloseq
(SIAMCAT
builds on the phyloseq
data structure):
show(sc.obj)
## siamcat-class object
## label() Label object: 88 CTR and 53 CRC samples
##
## contains phyloseq-class experiment-level object @phyloseq:
## phyloseq@otu_table() OTU Table: [ 1754 taxa and 141 samples ]
## phyloseq@sam_data() Sample Data: [ 141 samples by 6 sample variables ]
Since we have quite a lot of microbial species in the dataset at the moment, we
can perform unsupervised feature selection using the function filter.features
.
sc.obj <- filter.features(sc.obj,
filter.method = 'abundance',
cutoff = 0.001)
## Features successfully filtered
Associations between microbial species and the label can be tested
with the check.associations
function. The function computes for each species
the significance using a non-parametric Wilcoxon test and different effect
sizes for the association (e.g. AUC or fold change).
sc.obj <- check.associations(sc.obj, log.n0 = 1e-06, alpha = 0.05)
association.plot(sc.obj, sort.by = 'fc',
panels = c('fc', 'prevalence', 'auroc'))
The function produces a pdf file as output, since the plot is optimized for a landscape DIN-A4 layout, but can also used to plot on an active graphic device, e.g. in RStudio. The resulting plot then looks like that:
As many biological and technical factors beyond the primary phenotype of
interest can influence microbiome composition, simple association studies may
suffer confounding by other variables, which can lead to spurious results.
The check.confounders
function provides the option to test the associated
metadata variables for potential confounding influence. No information is stored
in the SIAMCAT
object, but the different analyses are visualized and saved to
a combined pdf file for qualitative interpretation.
check.confounders(sc.obj, fn.plot = 'confounder_plots.pdf',
meta.in = NULL, feature.type = 'filtered')
The conditional entropy check primarily serves to remove nonsensical variables from subsequent checks. Conditional entropy quantifies the unique information contained in one variable (row) respective to another (column). Identical variables and derived variables which share the exact same information will have a value of zero. In this example, the label was derived from the Group variable which was determined from AJCC stage, so both are excluded.