Contents

1 Introduction

The MicrobiomeBenchamrkData package provides access to a collection of datasets with biological ground truth for benchmarking differential abundance methods. The datasets are deposited on Zenodo: https://doi.org/10.5281/zenodo.6911026

2 Installation

## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")

## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData") 
library(MicrobiomeBenchmarkData)
library(purrr)

3 Sample metadata

All sample metadata is merged into a single data frame and provided as a data object:

data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |> 
    discard(~any(is.na(.x))) |> 
    head()
knitr::kable(sample_metadata)
dataset sample_id body_site library_size pmid study_condition sequencing_method
HMP_2012_16S_gingival_V13 700103497 oral_cavity 5356 22699609 control 16S
HMP_2012_16S_gingival_V13 700106940 oral_cavity 4489 22699609 control 16S
HMP_2012_16S_gingival_V13 700097304 oral_cavity 3043 22699609 control 16S
HMP_2012_16S_gingival_V13 700099015 oral_cavity 2832 22699609 control 16S
HMP_2012_16S_gingival_V13 700097644 oral_cavity 2815 22699609 control 16S
HMP_2012_16S_gingival_V13 700097247 oral_cavity 6333 22699609 control 16S

4 Accessing datasets

Currently, there are 6 datasets available through the MicrobiomeBenchmarkData. These datasets are accessed through the getBenchmarkData function.

4.2 Access a single dataset

In order to import a dataset, the getBenchmarkData function must be used with the name of the dataset as the first argument (x) and the dryrun argument set to FALSE. The output is a list vector with the dataset imported as a TreeSummarizedExperiment object.

tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment 
#> dim: 892 76 
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#>   variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL

4.3 Access a few datasets

Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:

list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

4.4 Access all of the datasets

If all of the datasets must to be imported, this can be done by providing the dryrun = FALSE argument alone.

mbd <- getBenchmarkData(dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#>  $ HMP_2012_16S_gingival_V13       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Ravel_2011_16S_BV               :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Stammler_2016_16S_spikein       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

5 Annotations for each taxa are included in rowData

The biological annotations of each taxa are provided as a column in the rowData slot of the TreeSummarizedExperiment.

## In the case, the column is named as taxon_annotation 
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#>                  kingdom      phylum       class           order
#>              <character> <character> <character>     <character>
#> OTU_97.31247    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44487    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34979    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34572    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.42259    Bacteria  Firmicutes     Bacilli Lactobacillales
#> ...                  ...         ...         ...             ...
#> OTU_97.44294    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45429    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44375    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45365    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45307    Bacteria  Firmicutes     Bacilli Lactobacillales
#>                        family         genus      taxon_annotation
#>                   <character>   <character>           <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ...                       ...           ...                   ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic

6 Cache

The datasets are cached so they’re only downloaded once. The cache and all of the files contained in it can be removed with the removeCache function.

removeCache()

7 Session information

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] purrr_1.0.2                     MicrobiomeBenchmarkData_1.4.0  
#>  [3] TreeSummarizedExperiment_2.10.0 Biostrings_2.70.1              
#>  [5] XVector_0.42.0                  SingleCellExperiment_1.24.0    
#>  [7] SummarizedExperiment_1.32.0     Biobase_2.62.0                 
#>  [9] GenomicRanges_1.54.0            GenomeInfoDb_1.38.0            
#> [11] IRanges_2.36.0                  S4Vectors_0.40.0               
#> [13] BiocGenerics_0.48.0             MatrixGenerics_1.14.0          
#> [15] matrixStats_1.0.0               BiocStyle_2.30.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] xfun_0.40               bslib_0.5.1             lattice_0.22-5         
#>  [4] yulab.utils_0.1.0       vctrs_0.6.4             tools_4.3.1            
#>  [7] bitops_1.0-7            generics_0.1.3          curl_5.1.0             
#> [10] parallel_4.3.1          RSQLite_2.3.1           tibble_3.2.1           
#> [13] fansi_1.0.5             blob_1.2.4              pkgconfig_2.0.3        
#> [16] Matrix_1.6-1.1          dbplyr_2.3.4            lifecycle_1.0.3        
#> [19] GenomeInfoDbData_1.2.11 compiler_4.3.1          treeio_1.26.0          
#> [22] codetools_0.2-19        htmltools_0.5.6.1       sass_0.4.7             
#> [25] RCurl_1.98-1.12         yaml_2.3.7              lazyeval_0.2.2         
#> [28] tidyr_1.3.0             pillar_1.9.0            crayon_1.5.2           
#> [31] jquerylib_0.1.4         BiocParallel_1.36.0     DelayedArray_0.28.0    
#> [34] cachem_1.0.8            abind_1.4-5             nlme_3.1-163           
#> [37] tidyselect_1.2.0        digest_0.6.33           dplyr_1.1.3            
#> [40] bookdown_0.36           fastmap_1.1.1           grid_4.3.1             
#> [43] cli_3.6.1               SparseArray_1.2.0       magrittr_2.0.3         
#> [46] S4Arrays_1.2.0          utf8_1.2.4              ape_5.7-1              
#> [49] withr_2.5.1             filelock_1.0.2          bit64_4.0.5            
#> [52] httr_1.4.7              rmarkdown_2.25          bit_4.0.5              
#> [55] memoise_2.0.1           evaluate_0.22           knitr_1.44             
#> [58] BiocFileCache_2.10.0    rlang_1.1.1             Rcpp_1.0.11            
#> [61] DBI_1.1.3               glue_1.6.2              tidytree_0.4.5         
#> [64] BiocManager_1.30.22     jsonlite_1.8.7          R6_2.5.1               
#> [67] fs_1.6.3                zlibbioc_1.48.0