Here, we demonstrate the standard workflow of the
SIAMCAT package using as
an example the dataset from Nielsen et al. Nat Biotechnol
This dataset contains samples from patients with inflammatory bowel disease and
More importantly, these samples have been collected in two different countries, Spain and Denmark. Together with technical differences between these samples, this introduces a potent confounding factor into the data. Here we are going to explore how
SIAMCAT would identify the confounding factor and what the
results would be if you account for the confounder or not.
First, we load the packages needed to perform the analyses.
library("tidyverse") library("SIAMCAT") library("ggpubr")
There are two different ways to access the data for our example dataset. On
the one hand, it is available through the
curatedMetagenomicsData R package.
However, using it here would create many more dependencies for the
Therefore, we here use data available through the EMBL cluster.
SIAMCAT paper, we performed the presented analyses on the datasets
curatedMetagenomicsData. If you want to reproduce the
analysis from the
SIAMCAT paper, you can execute the code chunks in
curatedMetageomicsData section, otherwise execute the code in the
First, we load the package:
The data are part of the
meta.nielsen.full <- combined_metadata %>% filter(dataset_name=='NielsenHB_2014')
One thing we have to keep in mind are repeated samples per subject (see also the Machine learning pitfalls vignette).
Some subjects (but not all) had been sampled multiple times. Therefore, we want to remove repeated samplings for the same subject, since the samples would otherwise not be indepdenent from another.
The visit number is encoded in the
sampleID. Therefore, we can use this
information to extract when the samples have been taken and use only the
first visit for each subject.
meta.nielsen <- meta.nielsen.full %>% select(sampleID, subjectID, study_condition, disease_subtype, disease, age, country, number_reads, median_read_length, BMI) %>% mutate(visit=str_extract(sampleID, '_[0-9]+$')) %>% mutate(visit=str_remove(visit, '_')) %>% mutate(visit=as.numeric(visit)) %>% mutate(visit=case_when(is.na(visit)~0, TRUE~visit)) %>% group_by(subjectID) %>% filter(visit==min(visit)) %>% ungroup() %>% mutate(Sample_ID=sampleID) %>% mutate(Group=case_when(disease=='healthy'~'CTR', TRUE~disease_subtype))
Now, we can restrict our metadata to samples with
UC and healthy control
meta.nielsen <- meta.nielsen %>% filter(Group %in% c('UC', 'CTR'))
As a last step, we can adjust the column names for the metadata so that they
agree with the data available from the EMBL cluster. Also, we add rownames to
the dataframe since
SIAMCAT needs rownames to match samples across metadata
meta.nielsen <- meta.nielsen %>% mutate(Country=country) meta.nielsen <- as.data.frame(meta.nielsen) rownames(meta.nielsen) <- meta.nielsen$sampleID
We can load the taxonomic profiles generated with MetaPhlAn2 via the curatedMetagenomicsData R package.
x <- 'NielsenHB_2014.metaphlan_bugs_list.stool' feat <- curatedMetagenomicData(x=x, dryrun=FALSE) feat <- feat[[x]]@assayData$exprs
The MetaPhlAn2 profiles contain information on different taxonomic levels. Therefore, we want to restrict them to species-level profiles. In a second step, we convert them into relative abundances (summing up to 1) instead of using the percentages (summing up to 100) that MetaPhlAn2 outputs.
feat <- feat[grep(x=rownames(feat), pattern='s__'),] feat <- feat[grep(x=rownames(feat),pattern='t__', invert = TRUE),] feat <- t(t(feat)/100)
The feature names are very long and may be a bit un-wieldy for plotting later on, so we shorten them to only the species name:
rownames(feat) <- str_extract(rownames(feat), 's__.*$')
Both metadata and features are available through the EMBL cluster:
# base url for data download data.location <- 'https://www.embl.de/download/zeller/metaHIT/' ## metadata meta.nielsen <- read_tsv(paste0(data.location, 'meta_Nielsen.tsv')) ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## Sample_ID = col_character(), ## Individual_ID = col_character(), ## Country = col_character(), ## Sampling_day = col_double(), ## Gender = col_character(), ## Age = col_double(), ## BMI = col_double(), ## Group = col_character(), ## Library_Size = col_double() ## ) # also here, we have to remove repeated samplings and CD samples meta.nielsen <- meta.nielsen %>% filter(Group %in% c('CTR', 'UC')) %>% group_by(Individual_ID) %>% filter(Sampling_day==min(Sampling_day)) %>% ungroup() %>% as.data.frame() rownames(meta.nielsen) <- meta.nielsen$Sample_ID ## features feat <- read.table(paste0(data.location, 'metaHIT_motus.tsv'), stringsAsFactors = FALSE, sep='\t', check.names = FALSE, quote = '', comment.char = '') feat <- feat[,colSums(feat) > 0] feat <- prop.table(as.matrix(feat), 2)
Now, we have everything ready to create a
SIAMCAT object which stores
the feature matrix, the meta-variables, and the label. Here, the label is
created using the information in the metadata.
To demonstrate the normal
SIAMCAT workflow, we remove the confounding factor
by only looking at samples from Spain. Below, we have a look what would have
happened if we had not removed them.
# remove Danish samples meta.nielsen.esp <- meta.nielsen[meta.nielsen$Country == 'ESP',] sc.obj <- siamcat(feat=feat, meta=meta.nielsen.esp, label='Group', case='UC') ## + starting create.label ## Label used as case: ## UC ## Label used as control: ## CTR ## + finished create.label.from.metadata in 0.037 s ## + starting validate.data ## +++ checking overlap between labels and features ## + Keeping labels of 128 sample(s). ## +++ checking sample number per class ## +++ checking overlap between samples and metadata ## + finished validate.data in 0.537 s
Now, we can filter feature with low overall abundance and prevalence.
sc.obj <- filter.features(sc.obj, cutoff=1e-04, filter.method = 'abundance') ## Features successfully filtered sc.obj <- filter.features(sc.obj, cutoff=0.05, filter.method='prevalence', feature.type = 'filtered') ## Features successfully filtered
check.assocation function calculates the significance of enrichment and
metrics of association (such as generalized fold change and single-feautre
sc.obj <- check.associations(sc.obj, detect.lim = 1e-06, alpha=0.1, max.show = 20, plot.type = 'quantile.rect', panels = c('fc'), fn.plot = './association_plot_nielsen.pdf')