# 1 Introduction to Scale Simulation using ALDEx2

ALDEx2 is software for fitting linear models on sequence count data developed by Fernandes et al. (2013). It attempts to model how the abundance (or expression) of entities changes between different biological conditions. As a special case, ALDEx2 is a method for differential expression or differential abundance when the covariate of interest is binary.

In short, the method works by testing Centered log-ratio (CLR)-transformed Dirichlet samples. Succinctly, for each entity (e.g., gene or taxa) $$j$$, it proceeds by:

1. Adding a small prior count to the observed counts for taxa $$j$$ across all samples.
2. Drawing Monte Carlo samples using the Dirichlet distribution.
3. Transforming the samples using the Centered log ratio (CLR) transform.
4. Fitting a model using the covariates of interest and testing a hypothesis (e.g., t-tests in the case of differential expression/abundance).
5. Averaging across test statistics and p-values to test significance for each entity.

The average test statistic and p-value1 Corrected p-values using either the Benjamini-Hochberg or Holm procedure are also returned. for each entity is returned and typically used by the end user to judge the significance of certain covariates on each taxa.

Step 3 of ALDEx2 relies on transforming the Dirichlet samples, which necessarily sum to one, to CLR coordinates. As noted by Nixon et al. (2023), the CLR can introduce a mismatch between the research question and the actual analysis that is carried out. To see this concretely, consider the case where we have access to a sequence count data set $$Y$$, which is a $$D \times N$$ matrix where $$D$$ is the number of entities and $$N$$ is the number of samples. $$Y$$ is a sample of some underlying system $$W$$. For example, in a study of the human gut, $$Y$$ denotes the sequenced microbes obtained from a fecal sample whereas $$W$$ represents the actual microbes that occupy the gut.

The scale of $$Y$$ does not match the scale of $$W$$ but rather is more reflective of the measurement process (Props et al. (2017), Vandeputte et al. (2017)). This is problematic when the research question depends on the scale of $$W$$ but is inferred from $$Y$$ alone. For example, consider the case where the analyst wants to estimate log-fold changes between two conditions:

$\theta_d = \text{mean}_{x_i\text{ in group 1}} \log W_{di} - \text{mean}_{x_i \text{ in group 2}} \log W_{di}.$ However, when using CLR coordinates, something different is estimated:

$\tilde{\theta}_d = \text{mean}_{x_i\text{ in group 1}} \text{CLR}( Y_{di}) - \text{mean}_{x_i \text{ in group 2}} \text{CLR}( Y_{di}).$

Note that $$\tilde{\theta}_d$$ is not necessarily an estimate of $$\theta_d$$ but something different: it is an estimate of the change in CLR coordinates of entity $$d$$ between conditions not an estimate in the change of the absolute abundances of entity $$d$$ between conditions. In fact, if $$\tilde{\theta_d}$$ is used as an estimate of $$\theta_d$$, a very specific assumption is being made equating the scale of the system to a function of the geometric mean of the observed data. Furthermore, this is being assumed with complete certainty which may skew analysis results. Importantly, all normalizations make assumptions of scale, see Nixon et al. (2023) for complete details.

While other normalizations are available within the ALDEx2 software, they similarly impose certain assumptions about the scales between conditions. In this vignette, we show how to step away from normalizations and instead rely on scale models for analysis.

## 1.1 From Normalizations to Scale Models

Introduced and developed in Nixon et al. (2023), scale simulation argues that many different types of analyses (including differential expression) are scale reliant, meaning that they inherently rely on the scale (e.g., total) of the system under study. Ignoring the scale can lead to incorrect inference or results that might not answer the right question.

To circumvent this, Nixon et al. (2023) develops scale simulation random variables (SSRVs) which, in essence, provides the framework to directly incorporate scale uncertainty in the modeling process and, in doing so, can greatly improve inference (i.e., control Type-I error). In addition, they developed the methodology to incorporate scale simulation within the context of ALDEx2.

In brief, SSRVs replace normalizations with draws from a scale model. To see this concretely, lets consider the case of log fold changes. That is, we want to estimate $$\theta_d$$ from the previous section. As discussed, this quantity depends on the scale of the system $$W$$. We can decompose $$W$$ into both a compositional (denoted by the superscript $$\parallel$$) and total (denoted by the superscript $$\perp$$) component:

$W_{dn} = W_{dn}^{\parallel}W^{\perp}_{n}$

where we can define each component as:

\begin{align} W^{\perp}_{n} &= \sum_{d=1}^{D}W_{dn} \nonumber\\ W^{\parallel}_{dn} &= \frac{W_{dn}}{W^{\perp}_{n}}. \end{align}

SSRVs rely on two models: a (which models $$W^\parallel$$) and a (which models $$W^\perp$$). That is, we take draws from the measurement model based on the available data $$Y$$ to get samples of $$W^\parallel$$ and draws from the scale model $$W^\perp$$ and combine them together to get draws for $$W$$.

Scale simulation subtly (but importantly) changes ALDEx2 in Step 3 above. After drawing Monte Carlo samples from the Dirichlet distribution (which denotes the measurement model for $$W^\parallel$$), scale simulation augments these samples with samples drawn from the scale model (as opposed to apply the CLR). This results in samples that better represent the scaled samples. See Nixon et al. (2023) for complete details.