Instructions on installing and using probabilisitic modelling and compositional data analysis using ALDEx2.

ALDEx2 1.33.0

This guide provides an overview of the R package ALDEx version 2 (*ALDEx2*) for differential (relative) abundance analysis of high throughput sequencing count-compositional data1 all high throughput sequencing data are compositional (Gloor et al. 2017) because of constraints imposed by the instruments. The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)2 Macklaim et al. (2013), but testing showed that it performed very well with traditional RNA-Seq datasets3 Quinn, Crowley, and Richardson (2018), 16S rRNA gene variable region sequencing4 Bian et al. (n.d.) and selective growth-type (SELEX) experiments5 McMurrough et al. (2014);Wolfs et al. (2016). In principle, the analysis method should be applicable to nearly any type of data that is generated by high-throughput sequencing that generates tables of per-feature counts for each sample(Fernandes et al. 2014): in addition to the examples outlined above, this would include ChIP-Seq or metagenome sequencing.

The *ALDEx2* package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns the posterior probability distribution of the observed data under a repeated sampling model. All outputs from *ALDEx2* are outputs from the posterior distributions, either expected values or confidence intervals.

*ALDEx2* uses the centred log-ratio (clr) transformation (or closely related log-ratio transforms) which ensures the data are scale invariant and sub-compositionally coherent6 Aitchison (1986). The scale invariance property removes the need for a between sample data normalization step since the data are all placed on a consistent numerical co-ordinate. The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). All feature abundance values are expressed relative to the geometric mean abundance of other features in a sample.This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description.

In contrast, most commonly used tools to analyze differential abundance of high throughput sequencing (HTS) data make the assumption that throughput sequencing data are delivered as counts7 Anders et al. (2013);Anders and Huber (2010);Gierliński et al. (2015). Much work has been done to normalize the datasets such that they approximate this assumption8 Bullard et al. (2010);Dillies et al. (2013). Even so, that the data are counts can be simply disproven by running the same library on instruments with different capacities and attempting to normalize. The relative relationships between the features (genes, OTUs, functions) are preserved but the actual count values are not.

With this simple realization, a number of groups have realized that HTS datasets are actually count-compositions9 Lovell et al. (2011);Friedman and Alm (2012);Fernandes et al. (2013);Fernandes et al. (2014);Gloor et al. (2017);Thomas P. Quinn et al. (2017) which have dramatically different statistical properties than do counts10 Aitchison (1986);Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015);Pawlowsky-Glahn and Buccianti (2011). Thus, *ALDEx2* is an `R`

package for differential relative abundance that takes into account the count-compositional nature of these data.

There are two ways to download and install the most current of *ALDEx2*. The most recent version of the package will be found at github.com/ggloor/ALDEx_bioc. The package will run with only the base R packages and a minimal Bioconductor install, and is capable of running several functions with the `parallel’ package if installed. It has been tested with version R version 3, but should run on version 2.12 onward as long as dependencies are met. It is recommended that the package be run on the most up-to-date R and Bioconductor versions. *ALDEx2* will make use of the BiocParallel package if possible, otherwise, *ALDEx2* will run in serial mode.

- Install development version from github:

```
install.packages("devtools")
devtools::install_github("ggloor/ALDEx_bioc")
```

- Install stable version from Bioconductor

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ALDEx2")
```

`aldex`

with 2 groups:The *ALDEx2* package in Bioconductor is modular and is suitable for the comparison of many different experimental designs. This is achieved by exposing the underlying centre log-ratio transformed Dirichlet Monte-Carlo replicate values to make it possible for anyone to add the specific R code for their experimental design — a guide to these values is outlined below.

However, *ALDEx2* contains an `aldex`

wrapper function that can perform many simple analyses. This wrapper will link the modular elements together to emulate *ALDEx2* prior to the modular approach. In the simplest incarnation, which we will use below, `aldex`

does a two-sample t-test and calculates effect sizes. If the test is ‘t’, then effect should be set to `TRUE`

. The ‘t’ option evaluates the data as a two-factor experiment using both the Welch’s t and the Wilcoxon rank tests. In more complex incarnations for multi-sample tests using ANOVA modules the effect size should not be calculated and then the `effect`

parameter should be FALSE. The ‘kw’ option evaluates the data as a one-way ANOVA using the glm and Kruskal-Wallace tests. All tests include a Benjamini-Hochberg correction of the raw P values. The data can be plotted onto Bland-Altmann11 Altman and Bland (1983) (MA) or effect (MW) plots12 Gloor, Macklaim, and Fernandes (2016) for for two-way tests using the `aldex.plot`

function. See the end of the modular section for examples of the plots.

This section contains an analysis of a dataset collected where a single gene library was made that contained 1600 sequence variants at 4 codons in the sequence13 McMurrough et al. (2014). These variants were cloned into an expression vector at equimolar amounts. The wild-type version of the gene conferred resistance to a topoisomerase toxin. Seven independent growths of the gene library were conducted under selective and non-selective conditions and the resulting abundances of each variant was read out by sequencing a pooled, barcoded library on an Illumina MiSeq. The data table is included as selex_table.txt in the package. In this data table, there are 1600 features and 14 samples. The analysis takes approximately 2 minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7 class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC). For speed concerns we use only the first 400 features and perform only 16 DMC. The command used for *ALDEx2* is presented below:

First we load the library and the included selex dataset. Then we set the comparison groups. This must be a vector of conditions in the same order as the samples in the input counts table. The `aldex`

command is calling several other functions in the background, and each of them returns diagnostics. These diagnostics are suppressed for this vignette.

```
library(ALDEx2)
data(selex)
#subset only the last 400 features for efficiency
selex.sub <- selex[1:400,]
conds <- c(rep("NS", 7), rep("S", 7))
x.all <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE,
include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE)
par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch", xlab="Log-ratio abundance",
ylab="Difference")
aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",
ylab="Difference")
```

The modular approach exposes the underlying intermediate data so that users can generate their own tests. The simple approach outlined above just calls `aldex.clr, aldex.ttest, aldex.effect`

in turn and then merges the data into one dataframe called `x.all`

. We will show these modules in turn, and then examine additional modules.

`aldex.clr`

moduleThe workflow for the modular approach first generates random instances of the centred log-ratio transformed values. There are three inputs: counts table, a vector of conditions, and the number of Monte-Carlo (DMC) instances; and several parameters: a string indicating if iqlr, zero or all feature are used as the denominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 or more mc.samples for the t-test, 1000 for a rigorous effect size calculation, and at least 16 for ANOVA.14 in fact we recommend that the number of samples in the smallest group multiplied by the number of DMC be equal at least 1000 in order to generate a reasonably stable estimate of the posterior distribution

This operation is fast.

```
# the output is in the S3 object 'x'
x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F)
```

`aldex.ttest`

moduleThe next operation performs the Welch’s t and Wilcoxon rank test for the situation when there are only two conditions. There are only two inputs: the aldex object from `aldex.clr`

and whether a paired test should be conducted or not (TRUE or FALSE).

This operation is reasonably fast.

`x.tt <- aldex.ttest(x, paired.test=FALSE, verbose=FALSE)`

`aldex.kw`

and the `aldex.glm`

modulesAlternatively to the t-test, the user can perform the glm and Kruskal Wallace tests for one-way ANOVA of two or more conditions. Here there are only two inputs: the aldex object from aldex.clr, and the vector of conditions. Note that this is slow! and is not evaluated for this documentation.

The `aldex.glm`

module is the preferred method for testing of more than two conditons or for complex study designs. See the `Complex study designs’ section below.

`x.kw <- aldex.kw(x)`

`aldex.effect`

moduleFinally, we estimate effect size and the within and between condition values in the case of two conditions. This step is required for plotting, in our lab we base our conclusions primarily on the output of this function15 Macklaim et al. (2013);McMurrough et al. (2014);Bian et al. (n.d.). There is one input: the aldex object from aldex.clr; and several useful parameters. It is possible to include the 95% confidence interval information for the effect size estimate with the boolean flag `CI=TRUE`

. This can be helpful when deciding whether to include or exclude specific features from consideration. We find that a large effect but that is an outlier for the extremes of the effect distribution can be false positives. New is the option to calculate effect sizes for paired t-tests or repeated samples with the boolean `paired.test=TRUE`

option. In this case the difference between and difference within values are calculated as the median paired difference and the median absolute deviation of that difference. The standardized effect size is their ratio and is usually slightly larger than the unpaired effect size. The supplied selex dataset is actually a paired dataset and can be used to explore this option.

`x.effect <- aldex.effect(x, CI=T, verbose=FALSE, paired.test=FALSE)`

`aldex.plot`

moduleFinally, the t-test and effect data are merged into one object.

`x.all <- data.frame(x.tt,x.effect)`

And the data are plotted. We see that the plotted data in Figure 1 and 2 are essentially the same.

```
par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch")
aldex.plot(x.all, type="MW", test="welch")
```