1 Prerequisites

The following knitr options will be used in this vignette to provide the most useful and concise output.

knitr::opts_chunk$set(message = FALSE)

The following packages will be used in this vignette to provide demonstrative examples of what a user might do with HMP16SData.


Pipe operators from the magrittr package are used in this vignette to provide the most elegant and concise syntax. See the magrittr vignette if the syntax is unclear.

2 Introduction

HMP16SData is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data for variable regions 1–3 and 3–5. Raw data files are provided in the package as downloaded from the HMP Data Analysis and Coordination Center. Processed data is provided as SummarizedExperiment class objects via ExperimentHub.

HMP16SData can be installed using BiocManager as follows.


Once installed, HMP16SData provides two functions to access data – one for variable region 1–3 and another for variable region 3–5. When called, as follows, the functions will download data from an ExperimentHub Amazon S3 (Simple Storage Service) bucket over https or load data from a local cache.

## class: SummarizedExperiment 
## dim: 43140 2898 
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
## colnames(2898): 700013549 700014386 ... 700114963 700114965
## class: SummarizedExperiment 
## dim: 45383 4743 
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(45383): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
## colnames(4743): 700013549 700014386 ... 700114717 700114750

The two data sets are represented as SummarizedExperiment objects, a standard Bioconductor class that is amenable to subsetting and analysis. To maintain brevity, details of the SummarizedExperiment class are not outlined here but the SummarizedExperiment package provides an excellent vignette.

3 Features

3.1 Frequency Table Generation

Sometimes it is desirable to provide a quick summary of key demographic variables and to make the process easier HMP16SData provides a function, table_one, to do so. It returns a data.frame or a list of data.frame objects that have been transformed to make a publication ready table.

V13() %>%
    table_one() %>%
##   Visit Number    Sex                                  Run Center
## 1          One Female                  Baylor College of Medicine
## 2          One   Male Baylor College of Medicine, Broad Institute
## 3          One   Male Baylor College of Medicine, Broad Institute
## 4          One   Male Baylor College of Medicine, Broad Institute
## 5          One   Male Baylor College of Medicine, Broad Institute
## 6          One   Male Baylor College of Medicine, Broad Institute
##            HMP Body Site HMP Body Subsite
## 1 Gastrointestinal Tract            Stool
## 2 Gastrointestinal Tract            Stool
## 3                   Oral           Saliva
## 4                   Oral    Tongue Dorsum
## 5                   Oral      Hard Palate
## 6                   Oral    Buccal Mucosa

If a list is passed to table_one, its elements must be named so that the named elements can be used by the kable_one function. The kable_one function will produce an HTML table for vignettes such as the one shown below.

list(V13 = V13(), V35 = V35()) %>%
    table_one() %>%
N % N %
Visit Number
One 1,642 56.66 2,822 59.50
Two 1,244 42.93 1,897 40.00
Three 12 0.41 24 0.51
Female 1,521 52.48 2,188 46.13
Male 1,377 47.52 2,555 53.87
Run Center
Genome Sequencing Center at Washington University 1,717 59.25 1,539 32.45
J. Craig Venter Institute 506 17.46 1,009 21.27
Broad Institute 0 0.00 1,078 22.73
Baylor College of Medicine 289 9.97 696 14.67
J. Craig Venter Institute, Genome Sequencing Center at Washington University 97 3.35 93 1.96
Baylor College of Medicine, Genome Sequencing Center at Washington University 91 3.14 90 1.90
Baylor College of Medicine, Broad Institute 76 2.62 103 2.17
J. Craig Venter Institute, Broad Institute 86 2.97 75 1.58
Baylor College of Medicine, J. Craig Venter Institute 15 0.52 40 0.84
Broad Institute, Baylor College of Medicine 13 0.45 13 0.27
Genome Sequencing Center at Washington University, Baylor College of Medicine 6 0.21 6 0.13
Genome Sequencing Center at Washington University, J. Craig Venter Institute 1 0.03 1 0.02
Missing 1 0.03 0 0.00
HMP Body Site
Oral 1,622 55.97 2,774 58.49
Skin 664 22.91 990 20.87
Urogenital Tract 264 9.11 391 8.24
Gastrointestinal Tract 187 6.45 319 6.73
Airways 161 5.56 269 5.67
HMP Body Subsite
Tongue Dorsum 190 6.56 316 6.66
Stool 187 6.45 319 6.73
Supragingival Plaque 189 6.52 313 6.60
Right Retroauricular Crease 187 6.45 297 6.26
Attached Keratinized Gingiva 181 6.25 313 6.60
Left Retroauricular Crease 186 6.42 285 6.01
Buccal Mucosa 183 6.31 312 6.58
Palatine Tonsils 186 6.42 312 6.58
Subgingival Plaque 183 6.31 309 6.51
Throat 170 5.87 307 6.47
Hard Palate 178 6.14 302 6.37
Saliva 162 5.59 290 6.11
Anterior Nares 161 5.56 269 5.67
Right Antecubital Fossa 146 5.04 207 4.36
Left Antecubital Fossa 145 5.00 201 4.24
Mid Vagina 89 3.07 133 2.80
Posterior Fornix 88 3.04 133 2.80
Vaginal Introitus 87 3.00 125 2.64

3.2 Straightforward Subsetting

The SummarizedExperiment container provides for straightforward subsetting by data or metadata variables using either the subset function or [ methods – see the SummarizedExperiment vignette for additional details. Shown below, the variable region 3–5 data set is subset to include only stool samples.

V35_stool <-
    V35() %>%
    subset(select = HMP_BODY_SUBSITE == "Stool")

## class: SummarizedExperiment 
## dim: 45383 319 
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(45383): OTU_97.1 OTU_97.10 ... OTU_97.9998 OTU_97.9999
## colnames(319): 700013549 700014386 ... 700114717 700114750

3.3 HMP Controlled-Access Participant Data

Most participant data from the HMP study is controlled through the National Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGaP). HMP16SData provides a data dictionary translated from dbGaP XML files for the seven different controlled-access data tables related to the HMP. See ?HMP16SData::dictionary for details of these source data tables, and View(dictionary) to view the complete data dictionary. Several steps are required to access the data tables, but the process is greatly simplified by HMP16SData.

3.3.1 Apply for dbGaP Access

You must make a controlled-access application through for:

HMP Core Microbiome Sampling Protocol A (HMP-A) (phs000228.v4.p1)

Once approved, browse to, sign in, and select the option “get dbGaP repository key” to download your *.ngc repository key. This is all you need from the dbGaP website.

3.3.2 Install the SRA Toolkit

You must also install the NCBI SRA Toolkit, which will be used in the background for downloading and decrypting controlled-access data.

There are shortcuts for common platforms:

  • Debian/Ubuntu: apt install sra-toolkit
  • macOS: brew install sratoolkit

For macOS, the brew command does not come installed by default and requires installation of the homebrew package manager. Instructions are available at

For Windows, binary installation is necessary and instructions are available at

3.3.3 Merge with HMP16SData

The attach_dbGap() function takes a HMP16SData SummarizedExperiment object as its first argument and the path to a dbGaP repository key as its second argument. It performs download, decryption, and merging of all available controlled-access participant data with a single function call.

V35_stool_protected <-
    V35_stool %>%

The returned V35_stool_protected object contains controlled-access participant data as additional columns in its colData slot.


3.4 Analysis Using the phyloseq Package

The phyloseq package provides an extensive suite of methods to analyze microbiome data.

For those familiar with both the HMP and phyloseq, you may recall that an alternative phyloseq class object containing the HMP variable region 3–5 data has been made available by Joey McMurdie at However, this object is not compatible with the methods documented here for integration with dbGaP controlled-access participant data, shotgun metagenomic data, or variable region 1–3 data. For that reason, we would encourage the use of the HMP16SData SummarizedExperiment class objects with the phyloseq package.

To demonstrate how HMP16SData could be used as a control or comparison cohort in microbime data analyses, we will demonstrate basic comparisons of the palatine tonsils and stool body subsites using the phyloseq package. We first create and subset two SummarizedExperiment objects from HMP16SData to include only the relevant body subsites.

V13_tonsils <-
    V13() %>%
    subset(select = HMP_BODY_SUBSITE == "Palatine Tonsils")

V13_stool <-
    V13() %>%
    subset(select = HMP_BODY_SUBSITE == "Stool")

While these objects are both from the HMP16SData package, a user would potentially be comparing to their own data and only need a single object from the package.

3.4.1 Coercion to phyloseq Objects

The SummarizedExperiment class objects can then be coerced to phyloseq class objects containing count data, sample (participant) data, taxonomy, and phylogenetic trees using the as_phyloseq function.

V13_tonsils_phyloseq <-

V13_stool_phyloseq <-

The analysis of all the samples in these two phyloseq objects would be rather computationally intensive. So to preform the analysis in a more timely manner, a function, sample_samples, is written here to take a sample of the samples available in each phyloseq object.

sample_samples <- function(x, size) {
    sampled_names <-
        sample_names(x) %>%

    prune_samples(sampled_names, x)

Each phyloseq object is then sampled to contain only twenty-five samples.

V13_tonsils_phyloseq %<>%

V13_stool_phyloseq %<>%

A “Study” identifier is then added to the sample_data of each phyloseq object to be used for stratification in analysis. In the case that a user were comparing the HMP samples to their own data, an identifier would be added in the same manner.

sample_data(V13_tonsils_phyloseq)$Study <- "Tonsils"

sample_data(V13_stool_phyloseq)$Study <- "Stool"

Once the two phyloseq objects have been sampled and their sample_data has been augmented, they can be merged into a single phyloseq object using the merge_phyloseq command.

V13_phyloseq <-
    merge_phyloseq(V13_tonsils_phyloseq, V13_stool_phyloseq)

Finally, because the V13 data were subset and sampled, taxa with no relative abundance are present in the merged object. These are removed using the prune_taxa command to avoid warnings during analysis.

V13_phyloseq %<>%
    taxa_sums() %>%
    is_greater_than(0) %>%

The resulting V13_phyloseq object can then be analyzed quickly and easily.

3.4.2 Alpha Diversity Analysis

Alpha diversity measures the taxonomic variation within a sample and phyloseq provides a method, plot_richness, to plot various alpha diversity measures.

First a vector of richness (i.e. alpha diversity) measures is created to be passed to the plot_richness method.

richness_measures <-
    c("Observed", "Shannon", "Simpson")

The V13_phyloseq object and the richness_measures vector are then passed to the plot_richness method to construct a box plot of the three alpha diversity measures. Additional ggplot2 syntax is used to control the presentational aspects of the plot.

V13_phyloseq %>%
    plot_richness(x = "Study", color = "Study", measures = richness_measures) +
    stat_boxplot(geom ="errorbar") +
    geom_boxplot() +
    theme_bw() +
    theme(axis.title.x = element_blank(), legend.position = "none")

3.4.3 Beta Diversity Analysis

Beta diversity measures the taxonomic variation between samples by calculating the dissimilarity of clade relative abundances. The phyloseq package provides a method, distance, to calculate various dissimilarity measures, such as Bray–Curtis dissimilarity. Once dissimilarity has been calculated, samples can then be clustered and represented as a dendrogram.

V13_dendrogram <-
    distance(V13_phyloseq, method = "bray") %>%
    hclust() %>%

However, coercion to a dendrogram object results in the lost of sample_data present in the phyloseq object which is needed for plotting. A data.frame of this sample_data can be extracted from the phyloseq object as follows.

V13_sample_data <-
    sample_data(V13_phyloseq) %>%

Samples in the the plots will be identified by “PSN”" (Primary Sample Number) and “Study”. So, additional columns to denote the colors and shapes of leaves and labels are added to the data.frame using dplyr syntax.

V13_sample_data %<>%
    rownames_to_column(var = "PSN") %>%
    mutate(labels_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
    mutate(leaves_col = if_else(Study == "Stool", "#F8766D", "#00BFC4")) %>%
    mutate(leaves_pch = if_else(Study == "Stool", 16, 17))

Additionally, the order of samples in the dendrogram and data.frame objects is different and a vector to sort samples is constructed as follows.

V13_sample_order <-
    labels(V13_dendrogram) %>%

The label and leaf color and shape columns of the data.frame object can then be coerced to vectors and sorted according to the sample order of the dendrogram object.

labels_col <- V13_sample_data$labels_col[V13_sample_order]
leaves_col <- V13_sample_data$leaves_col[V13_sample_order]
leaves_pch <- V13_sample_data$leaves_pch[V13_sample_order]

The dendextend package is then used to add these vectors to the dendrogram object as metadata which will be used for plotting.

V13_dendrogram %<>%
    set("labels_col", labels_col) %>%
    set("leaves_col", leaves_col) %>%
    set("leaves_pch", leaves_pch)

Finally, the dendextend package provides a method, circlize_dendrogram, to produce a circular dendrogram, using the circlize package, with a single line of code.

V13_dendrogram %>%
    circlize_dendrogram(labels_track_height = 0.2)