1. Biomarker discovery

Biomarker discovery has proven to be capacity to convert genomic data into clinical practice(Tothill et al. 2008; Banerjee et al. 2015). And many reports have shown that the microbial communities can be used as biomarkers for human disease(Kostic et al. 2012; Zhang et al. 2019; J. Yu et al. 2017; Ren et al. 2019). MicrobiotaProcess presents diff_analysis for the biomarker discovery. And It also provided the ggdiffclade, based on the ggtree(G. Yu et al. 2017; Yu et al. 2018), to visualize the results of diff_analysis. The rule of diff_analysis is similar with the LEfSe(Nicola Segata and Huttenhower 2011). First, all features are tested whether values in different classes are differentially distributed. Second, the significantly different features are tested whether all pairwise comparisons between subclass in different classes distinctly consistent with the class trend. Finally, the significantly discriminative features are assessed by LDA (linear discriminant analysis) or rf(randomForest). However, diff_analysis is more flexible. The test method of two step can be set by user, and we used the general fold change(Wirbel et al. 2019) and wilcox.test(default) to test whether all pairwise comparisons between subclass in different classes distinctly consistent with the class trend. Moreover, MicrobiotaProcess implements more flexible and convenient tools, (ggdiffclade, ggdiffbox, ggeffectsize and ggdifftaxbar) to produce publication-quality figures. Here, we present several examples to demonstrate how to perform different analysis with MicrobiotaProcess.

1.1 colorectal cancer dataset.

data(kostic2012crc)
kostic2012crc
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 177 samples ]
## sample_data() Sample Data:       [ 177 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]
#datatable(sample_data(kostic2012crc), options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
kostic2012crc <- phyloseq::rarefy_even_depth(kostic2012crc,rngseed=1024)
table(sample_data(kostic2012crc)$DIAGNOSIS)
## 
## Healthy   Tumor 
##      91      86

This datasets contained 86 Colorectal Carcinoma samples and 91 Control samples(remove the none sample information and low depth sample).In the research, they found the Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA, while the Firmicutes and Bacteroidetes phyla were depleted in tumors(Kostic et al. 2012).

set.seed(1024)
diffres <- diff_analysis(obj=kostic2012crc, classgroup="DIAGNOSIS",
                         mlfun="lda",
                         filtermod="fdr",
                         firstcomfun = "kruskal.test",
                         firstalpha=0.05,
                         strictmod=TRUE,
                         secondcomfun = "wilcox.test",
                         subclmin=3,
                         subclwilc=TRUE,
                         secondalpha=0.01, 
                         lda=3)
diffres
## The original data: 693 features and 177 samples
## The sample data: 1 variables and 177 samples
## The taxda contained 1351 by 7 rank
## after first test (kruskal.test) number of feature (fdr<=0.05):103
## after second test (wilcox.test) number of significantly discriminative feature:40
## after lda, Number of discriminative features: 36 (certain taxonomy classification:26; uncertain taxonomy classication: 10)

The results of diff_analysis is a S4 class, contained the original feature datasets, results of first test, results of second test, results of LDA or rf assessed and the record of some arguments. It can be visualized by ggeffectsize. The horizontal ordinate represents the effect size (LDA or MeanDecreaseAccuracy), the vertical ordinate represents the feature of significantly discriminative. And the colors represent the classgroup that the relevant feature is positive.

plotes <- ggeffectsize(obj=diffres) + scale_color_manual(values=c("#00AED7", "#FD9347"))
plotes

The results also can be visualized using ggdiffbox.

plotes_ab <- ggdiffbox(obj=diffres, box_notch=FALSE, colorlist=c("#00AED7", "#FD9347"), l_xlabtext="relative abundance")
plotes_ab

If the taxda was provided, it also can be visualized by ggdiffclade. The colors represent the relevant features enriched in the relevant classgroup. The size of point colored represent the -log10(pvalue).

diffcladeplot <- ggdiffclade(obj=diffres,
                             alpha=0.3, size=0.2, 
                             skpointsize=0.6,
                             taxlevel=3,
                             settheme=FALSE, 
                             setColors=FALSE) +
                 scale_fill_manual(values=c("#00AED7", "#FD9347"))+
                 guides(color = guide_legend(keywidth = 0.1,
                                             keyheight = 0.6,
                                             order = 3, 
                                             ncol=1)) + 
                 theme(panel.background=element_rect(fill=NA),
                       legend.position="right",
                       plot.margin=margin(0,0,0,0),
                       legend.spacing.y = unit(0.02, "cm"),
                       legend.title=element_text(size=7),
                       legend.text=element_text(size=6),
                       legend.box.spacing=unit(0.02,"cm"))
diffcladeplot

Moreover, the abundance of the features can be visualized by ggdifftaxbar. This will generate the figures in specific directory. And the horizontal ordinate of figures represent the sample of different classgroup, the vertical ordinate represent relative abundance of relevant features (sum is 1).

ggdifftaxbar(obj=diffres, xtextsize=1.5, output="./kostic2012crc_biomarkder_barplot")

And we also provided as.data.frame to produce the table of results of diff_analysis.

crcdiffTab <- as.data.frame(diffres)
#datatable(crcdiffTab, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))

As show in the results of diff_analysis, we also found Fusobacterium sequences were enriched in carcinomas, and Firmicutes, Bacteroides, Clostridiales were depleted in tumors. These results were consistent with the original article(Kostic et al. 2012). In addition, we also found Campylobacter were enriched in tumors, but the relative abundance of it is lower than Fusobacterium. And the species of Campylobacter has been proven to associated with the colorectal cancer(He et al. 2019; Wu et al. 2013; Amer et al. 2017).

1.2 a small subset of HMP dataset.

data(hmp_aerobiosis_small)
# contained "featureda" "sampleda"  "taxda" datasets.
#datatable(featureda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
#datatable(sampleda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
#datatable(taxda, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))
dim(featureda)
## [1]   55 1091
dim(sampleda)
## [1] 55  3
dim(taxda)
## [1] 699   6

This dataset is from a small subset of the HMP 16S dataset(Nicola Segata and Huttenhower 2011), contained 55 samples from 6 body sites. The dataset isn’t phyloseq class, because diff_analysis also supported the matrix datasets as input. The featureda contained the relative abundance of different levels features. The sampleda contained the information of the samples. And the taxda contained the information of the hierarchical relationship of taxonomy. We set the oxygen_availability in sampleda as classgroup, and body_site also in sampleda as subclass.

set.seed(1024)
hmpdiffres <- diff_analysis(obj=featureda, 
                            sampleda=sampleda, 
                            taxda=taxda, 
                            alltax=FALSE, 
                            classgroup="oxygen_availability",
                            subclass="body_site",
                            filtermod="fdr",
                            firstalpha=0.01,
                            strictmod=TRUE,
                            subclmin=3,
                            subclwilc=TRUE,
                            secondalpha=0.05,
                            ldascore=2)
hmpdiffres
## The original data: 1091 features and 55 samples
## The sample data: 2 variables and 55 samples
## The taxda contained 699 by 6 rank
## after first test (kruskal.test) number of feature (fdr<=0.01):131
## after second test (wilcox.test) number of significantly discriminative feature:40
## after lda, Number of discriminative features: 40 (certain taxonomy classification:40; uncertain taxonomy classication: 0)
hmpeffetsieze <- ggeffectsize(obj=hmpdiffres, 
                              setColors=FALSE,
                              settheme=FALSE) + 
                 scale_color_manual(values=c('#00AED7', '#FD9347', '#C1E168'))+
                 theme_bw()+
                 theme(strip.background=element_rect(fill=NA),
                       panel.grid=element_blank(),
                       strip.text.y=element_blank())
hmpeffetsieze

hmpes_ab <- ggdiffbox(obj=hmpdiffres, colorlist=c("#00AED7", "#FD9347", '#C1E168'), 
                      box_notch=FALSE, l_xlabtext="relative abundance(%)")
hmpes_ab

The explanation of figures refer to the previous section.

hmpdiffclade <- ggdiffclade(obj=hmpdiffres, alpha=0.3, size=0.2, 
                            skpointsize=0.4, taxlevel=3,
                            settheme=TRUE,
                            setColors=FALSE) +
                scale_fill_manual(values=c('#00AED7', '#FD9347', '#C1E168'))
hmpdiffclade

The explanation of figures refer to the previous section.

ggdifftaxbar(obj=hmpdiffres, output="./hmp_biomarker_barplot")

Finally, we found the Staphylococcus, Propionibacterium and some species of Actinobacteria was enriched in High_O2, these species mainly live in high oxygen environment. Some species of Bacteroides, species of Clostridia and species of Erysipelotrichi was enriched in Low_O2, these species mainly inhabit in the gut of human. These results were consistent with the reality.

hmpdiffTab <- as.data.frame(hmpdiffres)
#datatable(hmpdiffTab, options=list(scrollX=TRUE, scrollY="400px", scrollCollapse=TRUE))

2. Need helps?

If you have questions/issues, please visit github issue tracker.

3. Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] MicrobiotaProcess_1.2.1 tidytree_0.3.3          treeio_1.14.3          
##  [4] ggtree_2.4.1            phyloseq_1.34.0         forcats_0.5.1          
##  [7] stringr_1.4.0           dplyr_1.0.5             purrr_0.3.4            
## [10] readr_1.4.0             tidyr_1.1.3             tibble_3.1.0           
## [13] tidyverse_1.3.0         DT_0.17                 ggplot2_3.3.3          
## 
## loaded via a namespace (and not attached):
##   [1] TH.data_1.0-10      colorspace_2.0-0    ggsignif_0.6.1     
##   [4] modeltools_0.2-23   ellipsis_0.3.1      XVector_0.30.0     
##   [7] fs_1.5.0            aplot_0.0.6         rstudioapi_0.13    
##  [10] farver_2.1.0        ggrepel_0.9.1       mvtnorm_1.1-1      
##  [13] fansi_0.4.2         lubridate_1.7.10    coin_1.4-1         
##  [16] xml2_1.3.2          codetools_0.2-18    splines_4.0.4      
##  [19] libcoin_1.0-8       knitr_1.31          ade4_1.7-16        
##  [22] jsonlite_1.7.2      broom_0.7.5         cluster_2.1.1      
##  [25] dbplyr_2.1.0        BiocManager_1.30.10 compiler_4.0.4     
##  [28] httr_1.4.2          rvcheck_0.1.8       backports_1.2.1    
##  [31] assertthat_0.2.1    Matrix_1.3-2        lazyeval_0.2.2     
##  [34] cli_2.3.1           htmltools_0.5.1.1   prettyunits_1.1.1  
##  [37] tools_4.0.4         igraph_1.2.6        gtable_0.3.0       
##  [40] glue_1.4.2          reshape2_1.4.4      Rcpp_1.0.6         
##  [43] Biobase_2.50.0      cellranger_1.1.0    jquerylib_0.1.3    
##  [46] vctrs_0.3.6         Biostrings_2.58.0   rhdf5filters_1.2.0 
##  [49] multtest_2.46.0     debugme_1.1.0       ape_5.4-1          
##  [52] nlme_3.1-152        iterators_1.0.13    xfun_0.22          
##  [55] ps_1.6.0            Rmisc_1.5           rvest_1.0.0        
##  [58] lifecycle_1.0.0     gtools_3.8.2        zoo_1.8-9          
##  [61] zlibbioc_1.36.0     MASS_7.3-53.1       scales_1.1.1       
##  [64] hms_1.0.0           sandwich_3.0-0      parallel_4.0.4     
##  [67] biomformat_1.18.0   rhdf5_2.34.0        yaml_2.2.1         
##  [70] gridExtra_2.3       sass_0.3.1          reshape_0.8.8      
##  [73] stringi_1.5.3       highr_0.8           S4Vectors_0.28.1   
##  [76] foreach_1.5.1       permute_0.9-5       ggstar_1.0.1       
##  [79] BiocGenerics_0.36.0 matrixStats_0.58.0  rlang_0.4.10       
##  [82] pkgconfig_2.0.3     evaluate_0.14       lattice_0.20-41    
##  [85] Rhdf5lib_1.12.1     labeling_0.4.2      patchwork_1.1.1    
##  [88] htmlwidgets_1.5.3   tidyselect_1.1.0    plyr_1.8.6         
##  [91] magrittr_2.0.1      R6_2.5.0            IRanges_2.24.1     
##  [94] generics_0.1.0      multcomp_1.4-16     DBI_1.1.1          
##  [97] pillar_1.5.1        haven_2.3.1         withr_2.4.1        
## [100] mgcv_1.8-34         prettydoc_0.4.1     survival_3.2-10    
## [103] modelr_0.1.8        crayon_1.4.1        utf8_1.2.1         
## [106] rmarkdown_2.7       progress_1.2.2      grid_4.0.4         
## [109] readxl_1.3.1        data.table_1.14.0   vegan_2.5-7        
## [112] reprex_1.0.0        digest_0.6.27       stats4_4.0.4       
## [115] munsell_0.5.0       bslib_0.2.4

4. References

Amer, Abdrazak, Sheila Galvin, Claire M Healy, and Gary P Moran. 2017. “The Microbiome of Potentially Malignant Oral Leukoplakia Exhibits Enrichment for Fusobacterium, Leptotrichia, Campylobacter, and Rothia Species.” Frontiers in Microbiology 8: 2391. https://doi.org/10.3389/fmicb.2017.02391.

Banerjee, Sudeep, David S Wang, Hyun J Kim, Claude B Sirlin, Michael G Chan, Ronald L Korn, Aaron M Rutman, et al. 2015. “A Computed Tomography Radiogenomic Biomarker Predicts Microvascular Invasion and Clinical Outcomes in Hepatocellular Carcinoma.” Hepatology 62 (3): 792–800. https://doi.org/10.1002/hep.27877.

He, Zhen, Raad Z Gharaibeh, Rachel C Newsome, Jllian L Pope, Michael W Dougherty, Sarah Tomkovich, Benoit Pons, et al. 2019. “Campylobacter Jejuni Promotes Colorectal Tumorigenesis Through the Action of Cytolethal Distending Toxin.” Gut 68 (2): 289–300. https://doi.org/10.1136/gutjnl-2018-317200.

Kostic, Aleksandar D, Dirk Gevers, Chandra Sekhar Pedamallu, Monia Michaud, Fujiko Duke, Ashlee M Earl, Akinyemi I Ojesina, et al. 2012. “Genomic Analysis Identifies Association of Fusobacterium with Colorectal Carcinoma.” Genome Research 22 (2): 292–98. https://doi.org/10.1101/gr.126573.111.

Nicola Segata, Levi Waldron, Jacques Izard, and Curtis Huttenhower. 2011. “Metagenomic Biomarker Discovery and Explanation.” Genome Biology 12 (6): R60. https://doi.org/10.1186/gb-2011-12-6-r60.

Ren, Zhigang, Ang Li, Jianwen Jiang, Lin Zhou, Zujiang Yu, Haifeng Lu, Haiyang Xie, et al. 2019. “Gut Microbiome Analysis as a Tool Towards Targeted Non-Invasive Biomarkers for Early Hepatocellular Carcinoma.” Gut 68 (6): 1014–23. https://doi.org/10.1136/gutjnl-2017-315084.

Tothill, Richard W, Anna V Tinker, Joshy George, Robert Brown, Stephen B Fox, Stephen Lade, Daryl S Johnson, et al. 2008. “Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome.” Clinical Cancer Research 14 (16): 5198–5208. https://doi.org/10.1158/1078-0432.CCR-08-0196.

Wirbel, Jakob, Paul Theodor Pyl, Ece Kartal, Konrad Zych, Alireza Kashani, Alessio Milanese, Jonas S Fleck, et al. 2019. “Meta-Analysis of Fecal Metagenomes Reveals Global Microbial Signatures That Are Specific for Colorectal Cancer.” Nature Medicine 25 (4): 679. https://doi.org/10.1038/s41591-019-0406-6.

Wu, Na, Xi Yang, Ruifen Zhang, Jun Li, Xue Xiao, Yongfei Hu, Yanfei Chen, et al. 2013. “Dysbiosis Signature of Fecal Microbiota in Colorectal Cancer Patients.” Microbial Ecology 66 (2): 462–70. https://doi.org/10.1007/s00248-013-0245-9.

Yu, Guangchuang, Tommy Tsan-Yuk Lam, Huachen Zhu, and Yi Guan. 2018. “Two Methods for Mapping and Visualizing Associated Data on Phylogeny Using Ggtree.” Molecular Biology and Evolution 35 (2): 3041–3. https://doi.org/10.1093/molbev/msy194.

Yu, Guangchuang, David Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36. https://doi.org/10.1111/2041-210X.12628.

Yu, Jun, Qiang Feng, Sunny Hei Wong, Dongya Zhang, Qiao yi Liang, Youwen Qin, Longqing Tang, et al. 2017. “Metagenomic Analysis of Faecal Microbiome as a Tool Towards Targeted Non-Invasive Biomarkers for Colorectal Cancer.” Gut 66 (1): 70–78. https://doi.org/10.1136/gutjnl-2015-309800.

Zhang, Bangzhou, Shuangbin Xu, Wei Xu, Qiongyun Chen, Zhangran Chen, Yanyun Fan, Huangkai Zhang, et al. 2019. “Leveraging Fecal Bacterial Survey Data to Predict Colorectal Tumors.” Frontiers in Genetics 10: 447. https://doi.org/10.3389/fgene.2019.00447.