More detailed documentation is available at the DADA2 Home Page. In particular, the online tutorial workflow is the most detailed and up-to-date demonstration of applying DADA2 to multi-sample amplicon datasets.
The investigation of environmental microbial communities and microbiomes has been revolutionized by the development of high-throughput amplicon sequencing. In amplicon sequencing a particular genetic locus, for example the 16S rRNA gene in bacteria, is amplified from DNA extracted from the community of interest, and then sequenced on a next-generation sequencing platform. This technique removes the need to culture microbes in order to detect their presence, and cost-effectively provides a deep census of a microbial community.
However, the process of amplicon sequencing introduces errors into the sequencing data, and these errors severely complicate the interpretation of the results. DADA2 implements a novel algorithm that models the errors introduced during amplicon sequencing, and uses that error model to infer the true sample composition. DADA2 replaces the traditional “OTU-picking” step in amplicon sequencing workflows, producing instead higher-resolution tables of amplicon sequence variants (ASVs).
As seen in the paper introducing DADA2 and in further benchmarking, the DADA2 method is more sensitive and specific than traditional OTU methods: DADA2 detects real biological variation missed by OTU methods while outputting fewer spurious sequences. Another recent paper describes how replacing OTUs with ASVs improves the precision, comprehensiveness and reproducibility of marker-gene data analysys.
The starting point for the dada2 pipeline is a set of demultiplexed fastq files corresponding to the samples in your amplicon sequencing study. That is, dada2 expects there to be an individual fastq file for each sample (or two fastq files, one forward and one reverse, for each sample). Demultiplexing is often performed at the sequencing center, but if that has not been done there are a variety of tools do accomplish this, including the popular QIIME python scripts split_libraries_fastq.py followed by split_sequence_file_on_sample_ids.py, and the utility idemp, among others.
In addition, dada2 expects that there are no non-biological bases in the sequencing data. This may require pre-processing with outside software if, as is relatively common, your PCR primers were included in the sequenced amplicon region.
Once demultiplexed fastq files without non-biological nucleotides are in hand, the dada2 pipeline proceeds as follows:
filterAndTrim()
derepFastq()
learnErrors()
dada()
mergePairs()
makeSequenceTable()
removeBimeraDenovo()
The output of the dada2 pipeline is a feature table of amplicon sequence variants (an ASV table): A matrix with rows corresponding to samples and columns to ASVs, in which the value of each entry is the number of times that ASV was observed in that sample. This table is analogous to the traditional OTU table, except at higher resolution, i.e exact amplicon sequence variants rather than (usually 97%) clusters of sequencing reads.
We now go through the pipeline on a highly simplified dataset of just one paired-end sample (we’ll add a second later).
We’ll start by getting the filenames of our example paired-end fastq files. Usually you will define these filenames directly, or read them out of a directory, but for this tutorial we’re using files included with the package, which we can identify via a particular function call:
library(dada2); packageVersion("dada2")
## [1] '1.35.0'
fnF1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fnR1 <- system.file("extdata", "sam1R.fastq.gz", package="dada2")
filtF1 <- tempfile(fileext=".fastq.gz")
filtR1 <- tempfile(fileext=".fastq.gz")
Note that the dada2 package “speaks” the gzip format natively, therefore all fastq files can remain in the space-saving gzip format throughout.
Now that we have the filenames, we’re going to inspect the quality profile of our data:
plotQualityProfile(fnF1) # Forward
plotQualityProfile(fnR1) # Reverse