Microbiome time-series data presents several distinct challenges, including complex covariate dependencies and a variety of longitudinal study designs. Furthermore, while several individual tools have been created to aid in longitudinal microbiome data analysis, an all-inclusive R-based toolkit has remained elusive.
To address these concerns, we have created LegATo (Longitudinal mEtaGenomic Analysis TOolkit). LegATo is a suite of open-source software tools for longitudinal microbiome analysis that is extendable to several different study forms, all while promising optimal ease-of-use for researchers. This toolkit will allow researchers to determine which microbial taxa are affected over time by perturbations such as onset of disease or lifestyle choices, and to predict the effects of these perturbations over time, including changes in composition or stability of commensal bacteria.
Currently, LegATo entertains a number of data cleaning, aggregation, modeling and testing procedures. As we develop or learn of new methods, we will add to the toolkit accordingly. We will soon add hierarchical clustering tools and multivariate generalized estimating equations (JGEEs) to adjust for the compositional nature of microbiome data.
MultiAssayExperiment
and SummarizedExperiment
objectsThe convenience of LegATo is in part credited to the integration of MultiAssayExperiment
and SummarizedExperiment
objects. These are reliable and clean data structures developed by the as part of the MultiAssayExperiment and SummarizedExperiment packages.
The MultiAssayExperiment
container concisely stores a SummarizedExperiment
object, which aggregates data matrices along with annotation information, metadata, and reduced dimensionality data (PCA, t-SNE, etc.). To learn more about proper usage and context of these objects, you may want to take a look at the MultiAssayExperiment package vignette and SummarizedExperiment package vignette.
To install these two packages, run the following code:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MultiAssayExperiment")
BiocManager::install("SummarizedExperiment")
To install the development version of LegATo from Github, run the following code:
BiocManager::install("wejlab/LegATo")
Now we’ll load the LegATo package into our library:
library(LegATo)
For the purposes of this vignette, we’ll also load the ggeffects
and emmeans
library:
library(ggeffects)
library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
LegATo functions expect data properly formatted in a MultiAssayExperiment
container, with naming conventions that are compatible with animalcules. Obtaining properly formatted data can be achieved easily via the create_formatted_MAE()
function from the LegATo package, as detailed below.
If you’ve analyzed your metagenomic or 16S data via the MetaScope package, you can aggregate your sample outputs with user-input metadata using the MetaScope::convert_animalcules()
or MetaScope::convert_animalcules_silva()
functions prior to loading your data into LegATo. These functions create a taxonomy table and collate all data into a LegATo- and animalcules-compatible MultiAssayExperiment
object.
You can also format your data analyzed elsewhere (e.g., PathoScope 2.0 or Kraken) with the animalcules package in the upload step with the R Shiny app, and then select “Download Animalcules File” to obtain a MAE
object that can be used with LegATo.
create_formatted_MAE()
For the purposes of this example, we’ll look at some example inputs unrelated to the main HIV-E infant data:
counts
table is composed of raw microbial counts. There is one row for each taxon, and one column for each sample.tax
table is composed of taxonomy lineages. There is one row for each taxon, and any number of columns each taxonomic level (e.g., superkingdom, genus, species). The lowest level of taxonomy serves as the rownames for the table, and is also stored in its own column. Taxonomic levels increase in granularity from left to right.sample
table contains entries for each sample, with unique sample IDs on the rows. The columns are metadata for the samples, such as subjects or units on which repeated measures were taken, groupings, pairings, and covariates.rownames(counts) == rownames(tax)
colnames(counts) == colnames(sample)
counts <- system.file("extdata", "counts.csv", package = "LegATo") |>
read.csv(row.names = 1) |>
dplyr::rename_with(function(x) stringr::str_replace(x, "\\.", "-"))
tax <- system.file("extdata", "tax.csv", package = "LegATo") |> read.csv(row.names = 1)
sample <- system.file("extdata", "sample.csv", package = "LegATo") |> read.csv(row.names = 1)
Now we’ll look at the formatting of each:
ndim <- 5
counts[seq_len(ndim), seq_len(ndim)] |>
knitr::kable(caption = "Counts Table Preview",
label = NA)
X-1 | X-2 | X-3 | X-4 | X-5 | |
---|---|---|---|---|---|
Acinetobacter_beijerinckii | 430 | 18103 | 810 | 103 | 74 |
Acinetobacter_bouvetii | 0 | 0 | 0 | 0 | 0 |
Acinetobacter_guillouiae | 0 | 5 | 0 | 0 | 0 |
Acinetobacter_gyllenbergii | 6 | 1 | 1 | 1 | 11 |
Acinetobacter_indicus | 0 | 0 | 9 | 0 | 0 |
tax[seq_len(ndim), ] |>
knitr::kable(caption = "Taxonomy Table Preview",
label = NA)
superkingdom | phylum | class | order | family | genus | species | |
---|---|---|---|---|---|---|---|
Acinetobacter_beijerinckii | Bacteria | Proteobacteria | Gammaproteobacteria | Moraxellales | Moraxellaceae | Acinetobacter | Acinetobacter_beijerinckii |
Acinetobacter_bouvetii | Bacteria | Proteobacteria | Gammaproteobacteria | Moraxellales | Moraxellaceae | Acinetobacter | Acinetobacter_bouvetii |
Acinetobacter_guillouiae | Bacteria | Proteobacteria | Gammaproteobacteria | Moraxellales | Moraxellaceae | Acinetobacter | Acinetobacter_guillouiae |
Acinetobacter_gyllenbergii | Bacteria | Proteobacteria | Gammaproteobacteria | Moraxellales | Moraxellaceae | Acinetobacter | Acinetobacter_gyllenbergii |
Acinetobacter_indicus | Bacteria | Proteobacteria | Gammaproteobacteria | Moraxellales | Moraxellaceae | Acinetobacter | Acinetobacter_indicus |
sample[seq_len(ndim), ] |>
knitr::kable(caption = "Sample Table Preview",
label = NA)
Sample | Subject | Sex | Month | Group | Pairing | HairLength | Age | |
---|---|---|---|---|---|---|---|---|
X-1 | X-1 | S1 | Male | 1 | A | 1 | 48.62753 | 21.9 |
X-2 | X-2 | S2 | Male | 1 | B | 1 | 53.53195 | 67.7 |
X-3 | X-3 | S3 | Female | 1 | A | 2 | 47.61090 | 78.9 |
X-4 | X-4 | S4 | Female | 1 | B | 2 | 49.72870 | 48.6 |
X-5 | X-5 | S5 | Female | 1 | A | 3 | 59.04509 | 48.4 |
create_formatted_MAE()
Once your data are formatted correctly, you can easily use create_formatted_MAE()
like so:
output <- create_formatted_MAE(counts_dat = counts,
tax_dat = tax,
metadata_dat = sample)
class(output)
## [1] "MultiAssayExperiment"
## attr(,"package")
## [1] "MultiAssayExperiment"
MultiAssayExperiment::assays(output)
## List of length 1
## names(1): MicrobeGenetics
SummarizedExperiment::assays(output[["MicrobeGenetics"]])
## List of length 1
## names(1): MGX
If your data is in the format of a TreeSummarizedExperiment, you can call create_formatted_MAE(tree_SE)
to create a MAE
output that is compatible with the package.
If information needs to be added to your data object at some later point in the analysis, it is easiest to manipulate the raw data objects (potentially via parse_MAE_SE()
) and then recreate the MAE
object with create_formatted_MAE()
.
To illustrate the capabilities of LegATo, we will turn to a published dataset from the following paper:
Odom AR, Gill CJ, Pieciak R et al. Characterization of longitudinal nasopharyngeal microbiome patterns in maternally HIV-exposed Zambian infants [version 2; peer review: 2 approved with reservations]. Gates Open Res 2024, 6:143 (https://doi.org/10.12688/gatesopenres.14041.2)
The raw dataset is archived in Zenodo:
Zenodo: Underlying data for ‘Characterization of longitudinal nasopharyngeal microbiome patterns in maternally HIV-exposed Zambian infants’. https://doi.org/10.5281/zenodo.725531324
Further details on how the dataset was altered for inclusion in this package are provided here.
The example data consists of 167 NP swabs of healthy HIV-exposed, uninfected (HEU; n=10) infants and their HIV(+) mothers and HIV-unexposed, uninfected (HUU; n=10) infants and their HIV(-) mothers. A total of 7 samples were collected per infant, with some missingness in the data. These swabs were identified from a sample library collected in Lusaka, Zambia between 2015 and 2016.
The analysis objective is to parse the association between the NP resident bacteria and infant HIV exposure during the first 3.5 months (14 weeks) of life, a critical time in microbiome maturation.
All data was processed with PathoScope 2.0
, and the sample outputs were aggregated with the animalcules
R package.
For this analysis, we will work with the infant data only and subset the data accordingly.
dat <- system.file("extdata", "MAE.RDS", package = "LegATo") |>
readRDS()
dat_subsetted <- MultiAssayExperiment::subsetByColData(dat,
dat$MothChild == "Infant")
This leaves us with 129 samples in our analysis.
MultiAssayExperiment (MAE)
manipulationLegATo
has several functions for manipulating MAE
objects, delineated below:
MAE
sIf your data were aggregated via the animalcules
package, as was done for the example data, the samples were aggregated at the strain taxonomic level and have residual taxonomy IDs present. The clean_MAE()
function cleans up data suffering from these issues.
dat_cleaned <- clean_MAE(dat_subsetted)
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
Many metagenomic pipelines identify taxon abundances at extremely small levels, which can be noisy to deal with in an analysis. The filter_MAE
function smoothly transforms reads belonging to taxa with an overall genera threshold under the filter_prop
(filter proportion) argument, which we will set as 0.005.
dat_filt_1 <- filter_MAE(dat_cleaned, relabu_threshold = 0.05, occur_pct_cutoff = 10)
## The overall range of relative abundance counts between samples is (13218, 3016276)
## Number of OTUs that exhibit a relative abundance >0.05% in at least 10% of the total samples: 68/1136
The filter_animalcules_MAE
function utilizes the same filtering mechanism as in the animalcules package:
dat_filt <- filter_animalcules_MAE(dat_cleaned, 0.05)
If you want to take a closer look at your data, you can easily extract it into the counts, taxonomy, and sample metadata tables using the parse_MAE_SE()
function.
parsed <- parse_MAE_SE(dat_filt, which_assay = "MicrobeGenetics", type = "MAE")
parsed$counts[seq_len(5), seq_len(5)] |>
knitr::kable(caption = "Counts Table")
X1755 | X2216 | X2431 | X2561 | X2699 | |
---|---|---|---|---|---|
Corynebacterium accolens | 0 | 9643 | 0 | 0 | 43435 |
Corynebacterium ammoniagenes | 0 | 0 | 0 | 0 | 0 |
Corynebacterium appendicis | 0 | 0 | 0 | 0 | 0 |
Corynebacterium argentoratense | 0 | 0 | 0 | 4 | 0 |
Corynebacterium aurimucosum | 0 | 0 | 0 | 58 | 0 |
parsed$sam[seq_len(5), ] |>
knitr::kable(caption = "Sample Metadata")
Sample | Subject | HIVStatus | MothChild | timepoint | Age | pairing | |
---|---|---|---|---|---|---|---|
X1755 | X1755 | 0469-1 | Control | Infant | 0 | 7 | 1 |
X2216 | X2216 | 0507-1 | Control | Infant | 0 | 6 | 2 |
X2431 | X2431 | 0539-1 | Control | Infant | 0 | 9 | 3 |
X2561 | X2561 | 0554-1 | HIV | Infant | 0 | 6 | 1 |
X2699 | X2699 | 0620-1 | HIV | Infant | 0 | 4 | 2 |
parsed$tax[seq_len(5), ] |>
knitr::kable(caption = "Taxonomy Table")
superkingdom | phylum | class | order | family | genus | species | |
---|---|---|---|---|---|---|---|
Corynebacterium accolens | Bacteria | Actinobacteria | Actinomycetia | Corynebacteriales | Corynebacteriaceae | Corynebacterium | Corynebacterium accolens |
Corynebacterium ammoniagenes | Bacteria | Actinobacteria | Actinomycetia | Corynebacteriales | Corynebacteriaceae | Corynebacterium | Corynebacterium ammoniagenes |
Corynebacterium appendicis | Bacteria | Actinobacteria | Actinomycetia | Corynebacteriales | Corynebacteriaceae | Corynebacterium | Corynebacterium appendicis |
Corynebacterium argentoratense | Bacteria | Actinobacteria | Actinomycetia | Corynebacteriales | Corynebacteriaceae | Corynebacterium | Corynebacterium argentoratense |
Corynebacterium aurimucosum | Bacteria | Actinobacteria | Actinomycetia | Corynebacteriales | Corynebacteriaceae | Corynebacterium | Corynebacterium aurimucosum |
It is fairly convenient to summarize the average number of reads for a report with get_summary_table()
. The table groups by user-provided discrete covariates.
group_vars <- c("HIVStatus", "MothChild")
get_summary_table(dat_filt, group_vars) |>
knitr::kable(caption = "Summary Table", label = NA)
HIVStatus | MothChild | mean_reads | sd_reads | min_reads | max_reads | num_total |
---|---|---|---|---|---|---|
Control | Infant | 156209.2 | 408000.52 | 24176 | 3016276 | 68 |
HIV | Infant | 90008.2 | 51395.08 | 13218 | 202079 | 61 |
The get_top_taxa
function outputs a data.frame
that lists taxa in order of relative abundance.
best_genus <- get_top_taxa(dat_filt, "genus")
best_genus |> knitr::kable(caption = "Table of genera, ranked by abundance")
taxon | allmeans |
---|---|
Dolosigranulum | 0.1892754 |
Streptococcus | 0.1730879 |
Moraxella | 0.1713469 |
Staphylococcus | 0.1554891 |
Corynebacterium | 0.1290279 |
Other | 0.0979779 |
Haemophilus | 0.0837949 |
If you want to conduct your own analyses, the get_long_data()
function will prove convenient. The get_stacked_data()
function can be used for certain visualizations that utilize a relative abundance aggregation approach.
longdat <- get_long_data(dat_filt, "genus", log = TRUE, counts_to_CPM = TRUE)
stackeddat <- get_stacked_data(dat_filt, "genus", covariate_1 = "HIVStatus",
covariate_time = "timepoint")
There are several plots by which we can visualize changes in relative abundance over time, accounting for a given covariate. In this case, we are interested in HIV exposure.
We’ll select a palette using paletteer
.
this_palette <- c("#709AE1", "#8A9197", "#D2AF81", "#FD7446", "#D5E4A2", "#197EC0", "#F05C3B", "#46732E",
"#71D0F5", "#370335", "#075149", "#C80813", "#91331F", "#1A9993", "#FD8CC1") |>
magrittr::extract(seq_len(nrow(best_genus)))
Alluvial diagrams illustrate individual taxa as stream fields that change position at different time points. The height of a stream field represents the relative abundance of that taxon. At a given time point, stream fields are ranked from the highest to lowest abundance (top to bottom). These can be constructed with plot_alluvial()
.
plot_alluvial(dat = dat_filt,
taxon_level = "genus",
covariate_1 = "HIVStatus",
covariate_time = "timepoint",
palette_input = this_palette,
subtitle = "Alluvial plot")
In context, this plot indicates a high abundance of Staphylococcus in both the control and HIV-E infants in the first two weeks of life, but other genera quickly overtake Staphylococcus in dominance as shown by the changing rank of the streams for each time point. Overall, Streptococcus seems to have a far greater presence in the HIV-E infants’ nasopharyngeal microbiota
We can create spaghetti or volatility plots to elucidate changes over time on a sample level for a given taxon. This is advantageous as other visualization methods are often aggregates of multiple samples and lack granularity. These plots can be created with plot_spaghetti()
.
plot_spaghetti(dat = dat_filt,
covariate_time = "timepoint",
covariate_1 = "HIVStatus",
unit_var = "Subject",
taxon_level = "genus",
which_taxon = "Staphylococcus",
palette_input = this_palette,
title = "Spaghetti Plot",
subtitle = NULL) +
ggplot2::xlab("Infant Age (timepoint)") +
ggplot2::ylab("Relative Abundance (log CPM)")