---
title: "Workshop on microbiome data package in Bioconductor"
abstract: >
  This component of the workshop introduces `curatedMetagenomicData`, a resource providing uniformly 
  processed taxonomic and metabolic functional profiles for more than 5,000 whole metagenome shotgun sequencing samples from 26 publicly available studies, including the Human Microbiome Project, along with curated participant data. The `curatedMetagenomicData` package includes functions for converting to objects for use with the `phyloseq` package for taxonomy-aware analysis (see MicrobiomeWorkshopII.Rmd) and the `metagenomeSeq` package. This vignette also demonstrates the `treelapse` and `metavizr` packages for browsing and interactive visualization of microbiome profiles. 
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
vignette: >
    %\VignetteIndexEntry{MicrobiomeWorkshopI}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
isdevel <- Biobase::package.version("Biobase") > 2.37
```


# Install packages

Installing all necessary packages can be accomplished by the following command:

```{r, eval=FALSE}
library(BiocInstaller)
biocLite(c("waldronlab/MicrobiomeWorkshop", "epiviz/metavizr@bioc-workshop"),
         dependencies=TRUE)
```

Double-check that "metavizr" is installed from GitHub:
```{r NeededPackagespac}
if(!("metavizr" %in% installed.packages()) || "1.1.3" != installed.packages()["metavizr","Version"])
  devtools::install_github("epiviz/metavizr@bioc-workshop", build_vignettes = TRUE)
```

Load all needed packages:
```{r}
cranpkgs=c("ggplot2","devtools","vegan","httr")
BioCpkgs=c( "curatedMetagenomicData", "phyloseq")
ghpkgs= "metavizr"
allpkg <- c(cranpkgs, BioCpkgs, ghpkgs)
suppressPackageStartupMessages(sapply(allpkg, require, character.only = TRUE))
```

# `curatedMetagenomicData`

`curatedMetagenomicData` provides 6 types of data for each dataset:

1. Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
2. Presence of unique, clade-specific markers
3. Abundance of unique, clade-specific markers
4. Abundance of gene families
5. Metabolic pathway coverage
6. Metabolic pathway abundance

Types 1-3 are generated by 
[MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2); 4-6 are generated by [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2).

Currently, `curatedMetagenomicData` provides:

* 5,716 samples from 26 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
* Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor 
ExpressionSet objects
* ~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
* Processing of data through the [MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2) pipeline for taxonomic abundance, and [HUMAnN2](http://huttenhower.sph.harvard.edu/humann2) pipeline for metabolic analysis
* This effort required analyzing ~100T of raw sequencing data

These datasets are documented in the reference manuals of the [release](http://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html) or [devel](http://bioconductor.org/packages/devel/data/experiment/html/curatedMetagenomicData.html) versions of Bioconductor. The development version has approximately twice as many datasets and significantly more functionality. The AMI provided at the Bioc2017 workshop runs the development version of Bioconductor; see http://bioconductor.org/developers/how-to/useDevel/ for instructions on upgrading your own machine to the development version and switching between devel and release.

## Using `curatedMetagenomicData` Resources

Use of the resources in `curatedMetagenomicData` is simplified with the use of Bioconductor's ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, `curatedMetagenomicData` is installed using *BiocInstaller* and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets. 

```{r, message=FALSE}
# BiocInstaller::biocLite("curatedMetagenomicData")  #Bioconductor version
# BiocInstaller::biocLite("waldronlab/curatedMetagenomicData")  #bleeding edge version
suppressPackageStartupMessages(library(curatedMetagenomicData))
```

## Available samples and metadata

The manually curated metadata for all available samples are provided in a single table `combined_metadata`:

```{r eval=FALSE}
?combined_metadata
View(combined_metadata)
```

```{r}
table(combined_metadata$antibiotics_current_use)
table(combined_metadata$disease)
```


## Accessing datasets

Individual data projects can be fetched via per-dataset functions or the `curatedMetagenomicData()` function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

```{r, message=FALSE}
suppressPackageStartupMessages(library(curatedMetagenomicData))
zeller.eset = ZellerG_2014.metaphlan_bugs_list.stool()
```

The following creates a list of two `ExpressionSet` objects providing the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles (this method works in release or devel, but these datasets are available in devel only). See a complete list of available datasets in the PDF manual.

```{r, message=FALSE, results='hide', eval = isdevel}
oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
esl
esl[[1]]
esl[[2]]
```

And the following would provide all stool metaphlan datasets if `dryrun = FALSE` were set:

```{r, eval=TRUE}
curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)
```

### Merging multiple datasets

The following merges the two oral cavity datasets downloaded above into a single ExpressionSet (devel only). 

```{r, eval=isdevel}
eset <- mergeData(esl)
eset
```

This works for any number of datasets. The function will not stop you from merging different data types (e.g. metaphlan bugs lists with gene families), but you probably don't want to do that.

## Using `ExpressionSet` Objects

All datasets are represented as `ExpressionSet` objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the `Biobase` package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the `experimentData()` function is used to return a `MIAME` (Minimum Information About a Microarray Experiment) object.

```{r}
experimentData( zeller.eset )
```

To access the subject-level metadata the `pData()` function is used to return a `data.frame` containing subject-level variables of study.

```{r}
head( pData( zeller.eset ) )
```

To access the data itself (in this case relative abundance), the `exprs()` function returns a variables by samples (rows by columns) numeric matrix. Note the presence of "synthetic" clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:

```{r}
exprs( zeller.eset )[1:6, 1:5]  #first 6 rows and 5 columns
```

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent [introduction](https://tinyurl.com/ExpressionSetIntro).

## Creating `phyloseq` and `metagenomeSeq` objects

`curatedMetagenomicData` provides convenience functions for converting its *ExpressionSet* objects to *phyloseq::phyloseq* and *metagenomeSeq:MRExperiment* class objects for downstream analysis. For example, to convert the **zeller.eset** object to these two classes:

```{r}
ExpressionSet2phyloseq(zeller.eset)
ExpressionSet2MRexperiment(zeller.eset)
```

This also works on the merged **eset** object from above:

```{r, eval=isdevel}
ExpressionSet2phyloseq(eset)
ExpressionSet2MRexperiment(eset)
```

Examples of `phyloseq` analyses using curatedMetagenomicData datasets are provided in the [curatedMetagenomicData vignette](http://bioconductor.org/packages/devel/data/experiment/vignettes/curatedMetagenomicData/inst/doc/curatedMetagenomicData.html). 

## phylogenetic trees and UniFrac distances

Set **phylogenetictree = TRUE** to include a phylogenetic tree in the `phyloseq` object (bioc-devel only):

```{r, eval=isdevel}
zeller.tree <- ExpressionSet2phyloseq( zeller.eset, phylogenetictree = TRUE)
```

```{r, eval=isdevel}
wt = UniFrac(zeller.tree, weighted=TRUE, normalized=FALSE, 
             parallel=FALSE, fast=TRUE)
plot(hclust(wt), main="Weighted UniFrac distances")
```

## Exploratory analysis of E. coli prevalence

Here's a direct, exploratory analysis of *E. coli* prevalence in the zeller dataset using the `ExpressionSet` object. More elegant solutions will be provided later using subsetting methods provided by the `r BiocStyle::Biocpkg("phyloseq")` package, but for users familiar with `grep()` and the `ExpressionSet` object, such manual methods may suffice.

First, which *E. coli*-related taxa are available?

```{r}
grep("coli", rownames(zeller.eset), value=TRUE)
```

Create a vector of *E. coli* relative abundances. This `grep` call with a "$" at the end selects the only row that ends with "s__Escherichia_coli":
```{r}
x = exprs( zeller.eset )[grep("s__Escherichia_coli$", rownames( zeller.eset)), ]
summary( x )
```

This could be plotted as a histogram:
```{r}
hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
      breaks="FD")
```

## Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by multiplying the columns of the `ExpressionSet` data by the number of reads for each sample, as found in the `pData` column "number_reads". For demo purposes you could (but don't have to!) do this manually by dividing by 100 and multiplying by the number of reads:

```{r}
zeller.counts = sweep(exprs( zeller.eset ), 2, zeller.eset$number_reads / 100, "*")
zeller.counts = round(zeller.counts)
zeller.counts[1:6, 1:5]
```

or just set the `counts` argument in `curatedMetagenomicData()` to `TRUE`:

```{r}
zeller.eset2 = curatedMetagenomicData("ZellerG_2014.metaphlan_bugs_list.stool",
                                     counts = TRUE, 
                                     dryrun = FALSE)[[1]]
all.equal(exprs(zeller.eset2), zeller.counts)
```


# Taxonomy-Aware Analysis using `phyloseq`

For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of phylogeny-aware, ecological analysis and plotting by conversion to a `r BiocStyle::Biocpkg("phyloseq")` class object. `r BiocStyle::Biocpkg("curatedMetagenomicData")` provides the `ExpressionSet2phyloseq()` function to make this easy:

```{r, warning=FALSE}
suppressPackageStartupMessages(library(phyloseq))
zeller.pseq = ExpressionSet2phyloseq( zeller.eset )
```

Note the following useful arguments to ExpressionSet2phyloseq:

* `simplify`: use the most detailed level of the taxonomy for names (default=TRUE). Otherwise use the full taxonomy for names
* `phylogenetictree`: Add the MetaPhlAn2 phylogenetic tree, so UniFrac distances can be calculated (default=TRUE). Bioc-devel only.

# Visualization with Metavizr

We can now examine the zeller dataset with Metavizr. This visualization tool allows interactive exploration of the taxonomic hierarchy and statistically-guided visual analysis. Specifically, metavizr enables feature selection through a navigation mechanism called FacetZoom. We first create an MRexperiment for the zeller dataset using `ExpressionSet2MRexperiment`. For demonstration purposes in this workshop, we will subset the analysis to samples under 60 years old and either control or CRC study_condition.

```{r, eval=FALSE}
zeller_MR_expr = ExpressionSet2MRexperiment(zeller.eset)
zeller_MR_expr <- zeller_MR_expr[, which(pData(zeller_MR_expr)$age < 60)]
zeller_MR_expr <- zeller_MR_expr[,-which(pData(zeller_MR_expr)$study_condition == "adenoma")]

public_ipv4 <- try(httr::content(httr::GET("http://169.254.169.254/latest/meta-data/public-ipv4")), silent = TRUE)

if(grepl("Timeout was reached", simpleMessage(public_ipv4))){
  app <- startMetaviz(host="http://metaviz-dev.cbcb.umd.edu")
} else{
  app <- startMetaviz(host="http://metaviz-dev.cbcb.umd.edu", ws_host = public_ipv4, daemonized = FALSE, try_ports = TRUE)
}
```

From the metavizr vignette: Once the browser is open we can visualize the metagenomic features from the zeller_MR_expr object. We use the `plot` method to do so. The `plot` function can take as input any of the classes registered with epivirData.

*Bioconductor AMI note:* On rstudio-server version < 1.1.273, the user need to run the app$service() command after each user input on the UI. There is a bug with Rstudio-server interferring with the R event loop. metavizr should run without issue on rstudio desktop.

```{r, eval = FALSE}
facetZoom <- app$plot(zeller_MR_expr, type = "InnerNodeCounts", datasource_name = "zeller", feature_order = colnames(fData(zeller_MR_expr)))
app$service()
```

From the metavizr Vignette: You should now see a FacetZoom to explore the hierarchy of the metagenomic features from the MRexperiment object. To navigate the complex, hierarchical structure of the feature space, we made use of the FacetZoom visualization design and extended it to metagenomic data. Because of the limitations in the screen size and performance rendering big trees, this technique is an efficient visualization for navigating trees. The FacetZoom helps zoom in and out of trees and traverse subtrees. Every node in the tree has a state associated with it and are three possible states for a node 1) expand - use all subtree nodes during analysis 2) collapse - aggregate all nodes under the subtree to the selected node 3) remove - discard all nodes in the subtree for the analysis. The state of a node is also propagated to all its children and can be identified by the opacity of the node. Row level controls are available on the left side - to set the state of all nodes at a selected depth/taxonomic class of the hierarchy. The right hand side of the FacetZoom shows the location of the current subtree. Users can set states on the nodes to define a cut over the feature space. The cut defines how the count data is visualized in linked plots and charts. In addition to defining the cut, we also have a navigation bar to limit the range of features when querying count data. Navigation controls move the bar left/right and extend over the entire range of the current tree in the icicle. Each of these actions queries the underlying data structure and automatically propagates the changes to other visualizations in the workspace. When navigating outside the scope of the navigation bar, chevrons (left/right) appear on the navigation bar to help identify the current position. Hovering over a node highlights the entire lineage in the FacetZoom along with all other linked charts and plots.

We can now add a heatmap showing the count data from the MRexperiment. After the `app$service()` call and while 'Serving Epiviz, escape to continue interactive session...' is shown on the R command prompt, we can click on the center icon of the node label `P` on the left hand side of the FacetZoom. This will update the heatmap from showing counts at the Class level to show counts at the Phylum level. The message 'zeller_1  datasourceGroup caches cleared' should appear in the R session. We can also make a selection in the UI and run `app$service()` to process the request.

```{r, eval = FALSE}
heatmap_plot <- app$chart_mgr$revisualize(chart_type = "HeatmapPlot", chart = facetZoom)
app$service()
```

We can remove lineages and show counts at any level of the taxonomy by setting nodes states in the FacetZoom. Once we are done exploring the data, we can remove these charts by calling the chart manager.

```{r, eval=FALSE}
app$chart_mgr$rm_all_charts()
```

# Compute differential abundance with metagenomeSeq and use features selections in metaviz

Now we will use metagenomeSeq to compute differential abundance for the zeller dataset at different levels of the taxonomy. For the metagenomeSeq data model, we will keep only the leaf counts and then use the metagenomeSeq function aggregateByTaxonomy to compute counts at different levels of the hierarchy. 

```{r, eval=FALSE}
# Removes counts and features all levels of the taxonomy except for the lowest level
zeller_MR_expr <- zeller_MR_expr[-unname(which(is.na(fData(zeller_MR_expr)$Strain))),]

# Remove the t__ from each leaf level feature in the MRexperiment counts data
rownames(zeller_MR_expr) <- sapply(strsplit(rownames(zeller_MR_expr), "__"), function(i){unname(i)[2]})
```

Now we can use the fitFeatureModel function to estimate log fold change for the features at the lowest level of the taxonomy, in this case Strain, between samples in the `control` and `crc` study conditions. We first normalize the count data using the cumulative sum scaling method.

```{r, eval=FALSE}
# Normalize counts using cummulative sum scaling method.
zeller_MR_expr <- filterData(zeller_MR_expr, present = 5)
zeller_MR_expr <- cumNorm(zeller_MR_expr, p = 0.75)

# Set model to study_condition and use the fitFeatureModel to test
zeller_sample_data <-  pData(zeller_MR_expr)
mod <-  model.matrix(~1+study_condition, data = zeller_sample_data)
results_zeller <-  fitFeatureModel(zeller_MR_expr, mod)
logFC_zeller <- MRcoefs(results_zeller, number = nrow(results_zeller))
```

metagenomeSeq provides a variety of testing methods. We could also use the fitZig method to test for study_condition controlling for gender. fitZig adds posterior probabilities to the MRexperiment and so we create a separate MRexperiment to pass to that function. 

```{r, eval=FALSE}
# Creating a separate MRexperiment to pass to fitZig 
zig_zeller_MR_expr = ExpressionSet2MRexperiment(zeller.eset)
zig_zeller_MR_expr <- zig_zeller_MR_expr[, which(pData(zig_zeller_MR_expr)$age < 60)]
zig_zeller_MR_expr <- zig_zeller_MR_expr[,-which(pData(zig_zeller_MR_expr)$study_condition == "adenoma")]
zig_zeller_MR_expr <- zig_zeller_MR_expr[-unname(which(is.na(fData(zig_zeller_MR_expr)$Strain))),]
rownames(zig_zeller_MR_expr) <- sapply(strsplit(rownames(zig_zeller_MR_expr), "__"), function(i){unname(i)[2]})
zig_zeller_MR_expr <- filterData(zig_zeller_MR_expr, present = 5)
zig_zeller_MR_expr <- cumNorm(zig_zeller_MR_expr, p = 0.75)

zig_zeller_sample_data <-  pData(zig_zeller_MR_expr)

mod_zig <-  model.matrix(~1+study_condition+gender, data = zig_zeller_sample_data)
results_zeller_zig <-  fitZig(zig_zeller_MR_expr, mod_zig)
coefs_zeller_zig <- MRtable(results_zeller_zig, number = nrow(results_zeller_zig))
```

The metagenomeSeq package vignette has further details on the available tests implemented as well as other functions useful for analysis.

We will use the result from fitFeatureModel as it provides logFC estimates for the features at this level between samples in the control and crc groups. We can examine them in tabular form. We can also use these results to select feature names to remove from consideration in a Metaviz. After removing features using a logFC threshold and adjusted p-value cutoff, we can interactively explore the results for the remaining features at different levels of the hierarchy.

```{r, eval=FALSE}
features <- rownames(logFC_zeller)

featuresToKeep_names <- features[which(logFC_zeller[which(abs(logFC_zeller$logFC) > 2),]$adjPvalues <.1)]
featuresToKeep <- rep(2, length(featuresToKeep_names))
names(featuresToKeep) <-  featuresToKeep_names

featuresToRemove_names <- features[!(features %in% featuresToKeep_names)]
featuresToRemove <- rep(0, length(featuresToRemove_names))
names(featuresToRemove) <- featuresToRemove_names
featureSelection = c(featuresToKeep, featuresToRemove)
```

We can add a new FacetZoom, heatmap, and stacked plot and then select features of interest. Once the heatmap and stacked plot are added, we can navigate to the lowest level of the taxonomy and see the features removed from consideration.

```{r, eval=FALSE}
facetZoom <- app$plot(zeller_MR_expr, type = "LeafCounts", datasource_name="zeller_leaf_level", feature_order = colnames(fData(zeller_MR_expr)))
app$service()
heatmap <- app$chart_mgr$revisualize(chart_type = "HeatmapPlot", chart = facetZoom)
app$service()
stackedPlot <- app$chart_mgr$revisualize(chart_type ="StackedLinePlot", chart = facetZoom)
app$service()
```

Now we can call `propagateHierarchyChanges` with the features of interest. The Strain features should be removed that do not show differential abundance.

```{r, eval = FALSE}
app$get_ms_object(facetZoom)$propagateHierarchyChanges(featureSelection, request_with_labels = TRUE)
```

We can also perform the same fitFeatureModel calculation with counts aggregated to another level of the taxonomy and propagate those changes to the existing visualizations. In this case we will aggregate counts to the Class level of the taxonomy and then keep only features that show a logFC greater than 1 between the two groups.

```{r, eval=FALSE}
#Aggregate to the Class level
aggregation_level = "Class"
agg_zeller_MR_expr <- aggregateByTaxonomy(zeller_MR_expr, lvl=aggregation_level)

agg_zeller_MR_expr <-  cumNorm(agg_zeller_MR_expr, p = 0.75)
agg_mod <-  model.matrix(~1+study_condition, data = pData(agg_zeller_MR_expr))
agg_results_zeller <-  fitFeatureModel(agg_zeller_MR_expr, agg_mod)
agg_logFC_zeller <- MRcoefs(agg_results_zeller, number = nrow(agg_results_zeller))

agg_features <- rownames(agg_logFC_zeller)
agg_featuresToKeep_names <- agg_features[which(agg_logFC_zeller[which(abs(agg_logFC_zeller$logFC) > 1),]$adjPvalues <.1)]
agg_featuresToKeep <- rep(2, length(agg_featuresToKeep_names))
names(agg_featuresToKeep) <- agg_featuresToKeep_names

agg_featuresToRemove_names <- agg_features[!(agg_features %in% agg_featuresToKeep_names)]
agg_featuresToRemove <- rep(0, length(agg_featuresToRemove_names))
names(agg_featuresToRemove) <- agg_featuresToRemove_names

agg_selectionUpdate <- c(agg_featuresToKeep, agg_featuresToRemove)
app$get_ms_object(facetZoom)$propagateHierarchyChanges(agg_selectionUpdate, request_with_labels = TRUE)
```