High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2, structSSI and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.


The microbiome is formed of the ecological communities of microorganisms that dominate the living world. Bacteria can now be identified through the use of next generation sequencing applied at several levels. Shotgun sequencing of all bacteria in a sample delivers knowledge of all the genes present. Here we will only be interested in the identification and quantification of individual taxa (or species) through a ‘fingerprint gene’ called 16s rRNA which is present in all bacteria. This gene presents several variable regions which can be used to identify the different taxa.

Previous standard workflows depended on clustering all 16s rRNA sequences (generated by next generation amplicon sequencing) that occur within a 97% radius of similarity and then assigning these to ‘OTUs’ from reference trees (Caporaso et al. 2010; Schloss et al. 2009). These approaches do not incorporate all the data, in particular sequence quality information and statistical information available on the reads were not incorporated into the assignments.

In contrast, the de novo read counts used here will be constructed through the incorporation of both the quality scores and sequence frequencies in a probabilistic noise model for nucleotide transitions. For more details on the algorithmic implementation of this step see (Benjamin J Callahan et al. 2016).

After filtering the sequences and removing the chimeræ, the data are compared to a standard database of bacteria and labeled. In this workflow, we have used the labeled sequences to build a de novo phylogenetic with the .

The key step in the sequence analysis is the manner in which reads are denoised and assembled into groups we have chosen to call ASVs (Amplicon Sequence Variants)(Callahan, McMurdie, and Holmes 2017) instead of the traditional OTUs(Operational Taxonomic Units).

A published (but essentially similar) version of this workflow, including reviewer reports and comments is available (Ben J Callahan et al. 2016), see F1000Research From Raw reads.

There are extensive documentation and tutorial pages available for dada2 and phyloseq. If you have questions about this workflow, please start by consulting the relevant github issues sites for dada2, phyloseq, if the answers are not available, please post to the issues pages or Bioconductor forum. Note the posting guide for crafting an optimal question for the support site.

This is a workflow for denoising, filtering, performing data transformations, visualization, supervised learning analyses, community network tests, hierarchical testing and linear models. We provide all the code and give several examples of different types of analyses and use-cases. There are often many different objectives in experiments involving microbiome data and we will only give a flavor for what could be possible once the data has been imported into R.

In addition, the code can be easily adapted to accommodate batch effects, covariates and multiple experimental factors.

This workflow is based on software packages from the open-source Bioconductor project (Huber et al. 2015) and some CRAN packages.

We provide all steps necessary from the denoising and identification of the reads input as raw sequences as fastq files to the comparative testing and multivariate analyses of the samples and analyses of the abundances according to multiple available covariates.


Amplicon bioinformatics: from raw reads to tables

This section demonstrates the “full stack” of amplicon bioinformatics: construction of the sample-by-sequence feature table from the raw reads, assignment of taxonomy, and creation of a phylogenetic tree relating the sample sequences.

First we load the necessary packages.

.cran_packages <- c("ggplot2", "gridExtra")
.bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
   biocLite(.bioc_packages[!.inst], ask = F)
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE)
##   ggplot2 gridExtra     dada2  phyloseq  DECIPHER  phangorn 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE

The data we will analyze here are highly-overlapping Illumina Miseq 2x250 amplicon sequences from the V4 region of the 16S gene (Kozich et al. 2013). These 360 fecal samples were collected from 12 mice longitudinally over the first year of life one mock community control. These were collected to investigate the development and stabilization of the murine microbiome (Schloss et al. 2012). These data are downloaded from the following data location and unzip. For now just consider them paired-end fastq files to be processed. Define the following path variable so that it points to the extracted directory on your machine:

miseq_path <- "./MiSeq_SOP" # CHANGE to the directory containing the fastq files after unzipping.
##  [1] "F3D0_S188_L001_R1_001.fastq"   "F3D0_S188_L001_R2_001.fastq"  
##  [3] "F3D1_S189_L001_R1_001.fastq"   "F3D1_S189_L001_R2_001.fastq"  
##  [5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq"
##  [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq"
##  [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq"
## [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq"
## [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq"
## [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq"
## [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq"
## [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq"
## [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq"
## [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq"
## [25] "F3D2_S190_L001_R1_001.fastq"   "F3D2_S190_L001_R2_001.fastq"  
## [27] "F3D3_S191_L001_R1_001.fastq"   "F3D3_S191_L001_R2_001.fastq"  
## [29] "F3D5_S193_L001_R1_001.fastq"   "F3D5_S193_L001_R2_001.fastq"  
## [31] "F3D6_S194_L001_R1_001.fastq"   "F3D6_S194_L001_R2_001.fastq"  
## [33] "F3D7_S195_L001_R1_001.fastq"   "F3D7_S195_L001_R2_001.fastq"  
## [35] "F3D8_S196_L001_R1_001.fastq"   "F3D8_S196_L001_R2_001.fastq"  
## [37] "F3D9_S197_L001_R1_001.fastq"   "F3D9_S197_L001_R2_001.fastq"  
## [39] "filtered"                      "HMP_MOCK.v35.fasta"           
## [41] "Mock_S280_L001_R1_001.fastq"   "Mock_S280_L001_R2_001.fastq"  
## [43] "mouse.dpw.metadata"            "mouse.time.design"            
## [45] "stability.batch"               "stability.files"

Filter and Trim

We begin by filtering out low-quality sequencing reads and trimming the reads to a consistent length. While generally recommended filtering and trimming parameters serve as a starting point, no two datasets are identical and therefore it is always worth inspecting the quality of the data before proceeding.

First we read in the names of the fastq files, and perform some string manipulation to get lists of the forward and reverse fastq files in matched order:

# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(miseq_path, pattern="_R1_001.fastq"))
fnRs <- sort(list.files(miseq_path, pattern="_R2_001.fastq"))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(miseq_path, fnFs)
fnRs <- file.path(miseq_path, fnRs)
## [1] "./MiSeq_SOP/F3D0_S188_L001_R1_001.fastq"   "./MiSeq_SOP/F3D1_S189_L001_R1_001.fastq"  
## [3] "./MiSeq_SOP/F3D141_S207_L001_R1_001.fastq"
## [1] "./MiSeq_SOP/F3D0_S188_L001_R2_001.fastq"   "./MiSeq_SOP/F3D1_S189_L001_R2_001.fastq"  
## [3] "./MiSeq_SOP/F3D141_S207_L001_R2_001.fastq"

Most Illumina sequencing data shows a trend of decreasing average quality towards the end of sequencing reads.

The first two forward reads: