1 Introduction

MMUPHin is an R package implementing meta-analysis methods for microbial community profiles. It has interfaces for: a) covariate-controlled batch and study effect adjustment, b) meta-analytic differential abundance testing, and meta-analytic discovery of c) discrete (cluster-based) or d) continuous unsupervised population structure.

Overall, MMUPHin enables the normalization and combination of multiple microbial community studies. It can then help in identifying microbes, genes, or pathways that are differential with respect to combined phenotypes. Finally, it can find clusters or gradients of sample types that reproduce consistently among studies.

2 Installation

MMUPHin is a Bioconductor package and can be installed via the following command.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MMUPHin")

This vignette is intended to provide working examples for all four functionalities of MMUPHin.

MMUPHin can be loaded into R with the first library() command shown below. We also load some utility packages for data manipulation and visualization.

library(MMUPHin)
library(magrittr)
library(dplyr)
library(ggplot2)

3 Input data

As input, MMUPHin requires a properly formatted collection of microbial community studies, with both feature abundances and accompanying metadata. Here we use the five published colorectal cancer (CRC) stool metagenomic studies, incorporated in Thomas et al. (2019). Data for the studies are already conveniently packaged and accessible through the Biocondcutor package curatedMetagenomicData, though additional wranglings are needed to format input for MMUPHin.

Importantly, MMUPHin asks that feature abundances be provided as a feature-by-sample matrix, and the metadata be provided as a data frame. The two objects shoud agree on sample IDs, that is, rowname of the feature abundance matrix and colname of the metadata data frame must agree. Many popular ’omic data classes in R already enforce this standard, such as ExpressionSet from Biobase, or phyloseq from phyloseq.

To minimize users’ efforts in loading data to run the examples, we have properly formatted the five studies for easy access. The feature abundances and metadata can be loaded with the following code chunk. For the interested user, the commented out scripts were used for accessing data directly from curatedMetagenomicData and formatting. It might be worthwhile to read through these as they perform many of the common tasks for preprocessing microbial feature abundance data in R, including sample/feature subsetting, normalization, filtering, etc.

data("CRC_abd", "CRC_meta")
CRC_abd[1:5, 1, drop = FALSE]
##                                 FengQ_2015.metaphlan_bugs_list.stool:SID31004
## s__Faecalibacterium_prausnitzii                                    0.11110668
## s__Streptococcus_salivarius                                        0.09660736
## s__Ruminococcus_sp_5_1_39BFAA                                      0.09115385
## s__Subdoligranulum_unclassified                                    0.05806767
## s__Bacteroides_stercoris                                           0.05685503
CRC_meta[1, 1:5]
##                                               subjectID body_site
## FengQ_2015.metaphlan_bugs_list.stool:SID31004  SID31004     stool
##                                               study_condition
## FengQ_2015.metaphlan_bugs_list.stool:SID31004             CRC
##                                                                    disease
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC;fatty_liver;hypertension
##                                               age_category
## FengQ_2015.metaphlan_bugs_list.stool:SID31004        adult
table(CRC_meta$studyID)
## 
##      FengQ_2015.metaphlan_bugs_list.stool 
##                                       107 
## HanniganGD_2017.metaphlan_bugs_list.stool 
##                                        55 
##  VogtmannE_2016.metaphlan_bugs_list.stool 
##                                       104 
##        YuJ_2015.metaphlan_bugs_list.stool 
##                                       128 
##    ZellerG_2014.metaphlan_bugs_list.stool 
##                                       157

4 Performing batch (study) effect adjustment with adjust_batch

adjust_batch aims to correct for technical study/batch effects in microbial feature abundances. It takes as input the feature-by-sample abundance matrix, and performs batch effect adjustment given provided batch and optional covariate variables. It returns the batch-adjusted abundance matrix. Check help(adjust_batch) for additional details and options. This figure from the paper (figure 2b) shows the desired effect of the batch adjustment: batch effects are removed, making the effect of the biological variable of interest easier to detect.

Here we use adjust_batch to correct for differences in the five studies, while controlling for the effect of CRC versus control (variable study_condition in CRC_meta).

adjust_batch returns a list of multiple outputs. The part we are most interested in now is the adjusted abundance table, which we assign to CRC_abd_adj.

fit_adjust_batch <- adjust_batch(feature_abd = CRC_abd,
                                 batch = "studyID",
                                 covariates = "study_condition",
                                 data = CRC_meta,
                                 control = list(verbose = FALSE))

CRC_abd_adj <- fit_adjust_batch$feature_abd_adj

One way to evaluate the effect of batch adjustment is to assess the total variability in microbial profiles attributable to study differences, before and after adjustment. This is commonly known as a PERMANOVA test (Tang, Chen, and Alekseyenko 2016), and can be performed with the adonis function in vegan. The inputs to adonis() are indices of microbial compositional dissimilarity.

library(vegan, quietly = TRUE)
## This is vegan 2.5-7
D_before <- vegdist(t(CRC_abd))
D_after <- vegdist(t(CRC_abd_adj))

set.seed(1)
fit_adonis_before <- adonis(D_before ~ studyID, data = CRC_meta)
fit_adonis_after <- adonis(D_after ~ studyID, data = CRC_meta)
print(fit_adonis_before)
## 
## Call:
## adonis(formula = D_before ~ studyID, data = CRC_meta) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## studyID     4     13.36  3.3400  11.855 0.07991  0.001 ***
## Residuals 546    153.82  0.2817         0.92009           
## Total     550    167.18                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(fit_adonis_after)
## 
## Call:
## adonis(formula = D_after ~ studyID, data = CRC_meta) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## studyID     4     4.939 1.23463  4.2855 0.03044  0.001 ***
## Residuals 546   157.298 0.28809         0.96956           
## Total     550   162.237                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that, before study effect adjustment, study differences can expalin a total of 7.99% of the variability in microbial abundance profiles, whereas after adjustment this was reduced to 3.04%, though the effect of study is significant in both cases.

5 Meta-analytical differential abundance testing with lm_meta

One of the most common meta-analysis goals is to combine association effects across batches/studies to identify consistent overall effects. lm_meta provides a straightforward interface to this task, by first performing regression analysis in individual batches/studies using the well-validated Maaslin2 package, and then aggregating results with established fixed/mixed effect models, realized via the vegan package. Here, we use lm_meta to test for consistently differential abundant species between CRC and control samples across the five studies, while controlling for demographic covariates (gender, age, BMI). Again, lm_meta returns a list of more than one components. meta_fits provides the final meta-analytical testing results. See help(lm_meta) for the meaning of other components.

fit_lm_meta <- lm_meta(feature_abd = CRC_abd,
                       batch = "studyID",
                       exposure = "study_condition",
                       covariates = c("gender", "age", "BMI"),
                       data = CRC_meta,
                       control = list(verbose = FALSE))
## Warning in lm_meta(feature_abd = CRC_abd, batch = "studyID", exposure = "study_condition", : Covariate gender is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd, batch = "studyID", exposure = "study_condition", : Covariate age is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd, batch = "studyID", exposure = "study_condition", : Covariate BMI is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
meta_fits <- fit_lm_meta$meta_fits

You’ll note that this command produces a few error messages. This is because for one of the studies the specified covariates were not measured. As the message states, these covariates are not included when analyzing data from that study.

We can visualize the significant (FDR q < 0.05) species associated with CRC in these studies/samples. Comparing them with Figure 1b of Thomas et al. (2019), we can see that many of the significant species do agree.

meta_fits %>% 
  filter(qval.fdr < 0.05) %>% 
  arrange(coef) %>% 
  mutate(feature = factor(feature, levels = feature)) %>% 
  ggplot(aes(y = coef, x = feature)) +
  geom_bar(stat = "identity") +
  coord_flip()

6 Identifying discrete population structures with discrete_discover

Clustering analysis of microbial profiles can help identify meaningful discrete population subgroups (Ravel et al. 2011), but must be carried out carefully with validations to ensure that the identified structures are consistent (Koren et al. 2013). discrete_discover provides the functionality to use prediction strength (Tibshirani and Walther 2005) to evaluate discrete clustering structures within individual microbial studies, as well as a “generalized prediction strength” to evaluate their reproducibility in other studies. These jointly provide meta-analytical evidence for (or against) identifying discrete population structures in microbial profiles. Check help(discrete_discover) to see more details on the method and additional options. We can contrast discrete structure (clusters) with continuous structure (which will be overviewed in the next section):