Instructions on installing and using probabilisitic modelling and compositional data analysis using ALDEx2.
This provides an overview of the R package ALDEx version 2 (ALDEx2) for differential (relative) abundance analysis of high throughput sequencing count compositional data1 all high throughput sequencing data are compositional (Gloor et al. 2017; Gloor 2023) because of constraints imposed by the instruments. However, Nixon et al.2 not accounting for scale leads to both false positive and false negative inference and is also a principled method of accounting for asymmetry in the underlying environment Nixon et al. (2023) have found that the scale of the data can be modelled and ALDEx2 now includes scale modelling by default.
The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)3 Macklaim et al. (2013), but testing showed that it performed very well with traditional RNA-Seq datasets4 Quinn, Crowley, and Richardson (2018), 16S rRNA gene variable region sequencing5 Bian et al. (n.d.) and selective growth-type (SELEX) experiments6 McMurrough et al. (2014);Wolfs et al. (2016), and that the underlying principles generalize to single-cell sequencing7 Gloor (2023). The analysis method should be applicable to any type of data that is generated by high-throughput8 Fernandes et al. (2014).
Most recently, Nixon et al.9 Nixon et al. (2023) showed that the use of the CLR10 and all other normalizations which are based on ratios — which is to say all of them made strong assumptions about the scale of the underlying system. ALDEx2 now includes methods to include scale uncertainty and so incorporate both compositional and scale uncertainty together in the analysis. See the scaleSim vignette for more information and a complete walk-through.
There are two ways to download and install the most current of ALDEx2. The most recent version of the package will be found at github.com/ggloor/ALDEx_bioc. The package will run with only the base R packages, the standard tidyverse packages and a minimal Bioconductor install, and is capable of running several functions with the
parallel package if installed. It is recommended that the package be run on the most up-to-date R and Bioconductor versions.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ALDEx2")
Here all dependencies should be installed automatically.
In this case, you may need to install additional packages manually.
ALDEx2 is a different tool than others used for differential abundance testing, but that does not mean it is complicated. Instead, it is incredibly simple once you understand four concepts.
First, the data you collect is a single point-estimate of the data. That means, if you sequenced the same samples, or even the same library, you would get a slightly different answer. We know this because there are a number of early technical replication papers that show exactly this11 Marioni et al. (2008).
Second, that the data collected are probabilistic; that the count data is a representation of the probability of sampling the gene fragment from the mixture of gene fragments in the library. Many other tools assume the data are counts from a multivariate distribution modelled by a Negative Binomial distribution.
Third, that we can pretty accurately estimate this probabilisitic technical variation. For this, we assume the data collected are from a multivariate Poisson distribution constrained to have the total number of counts that the instrument can deliver12 Fernandes et al. (2014);Gregory B Gloor et al. (2016). This can be estimated by sampling from the appropriate distribution, in this case a multinomial Dirichlet distribution.
Fourth, that the use of the CLR (or any other normalization) makes strong assumptions about the scale of the sampled environment. Nixon et al.13 Nixon et al. (2023) demonstrated how to incorporate scale uncertainty so that the effect of scale on the analysis can be modelled and accounted for.
More formally, the ALDEx2 package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns a posterior probability distribution of the observed data under a repeated sampling model. This distribution is converted to a log-ratio that linearizes the differences between features. This log-ratio can be modified to include uncertainty regarding the scale of the data when the log-ratio is calculated. Thus, the values returned by ALDEx2 are posterior estimates of test statistics calculated on log-ratio transformed distributions. ALDEx2 has been altered to ensure that the elements of the posterior statistical tests agree on their sign with the outcome that some edge cases with low predicted p-values are no longer counted as significant at high false discovery rates. Combining two-sample tests into a posterior estimate is somewhat intricate14 Van Zwet and Oosterhoff (1967), but in practice the outcomes should be very similar to those seen before, but edge cases with expected false discovery rates greater than 0.05 have become more rigorous.
ALDEx2 uses by default the centred log-ratio (clr) transformation which is scale invariant15 Aitchison (1986). While this is a convenient assumption for many cases, this can lead to Type 1 and Type 2 errors16 Nixon et al. (2023) and we now suggest that modest amounts of scale be built into the analysis by default.
The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). In the absence of scale, all feature abundance values are expressed relative to the geometric mean abundance of other features in a sample. This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description. If the question at hand is relative change, then ALDEx2 can do this. If the question at hand is scale dependent, then ALDEx2 can do that too!
aldexwith 2 groups:
The ALDEx2 package in Bioconductor is suitable for the comparison of many different experimental designs.
The test dataset is a single gene library with 1600 sequence variants17 McMurrough et al. (2014) cloned into an expression vector at approximately equimolar amounts. The wild-type version of the gene conferred resistance to a topoisomerase toxin. Seven independent growths of the gene library were conducted under selective and non-selective conditions and the resulting abundances of each variant was read out by sequencing a pooled, barcoded library on an Illumina MiSeq. The data table is accessed by
data(selex). In this data table, there are 1600 features and 14 samples. The analysis takes approximately 2 minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7 class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC). On an M2 Pro chip, this takes approximately 6 seconds. For speed concerns we use only the last 400 features and perform only 16 DMC. The command used for ALDEx2 is presented below:
library(ALDEx2) #subset only the last 400 features for efficiency data(selex) selex.sub <- selex[1200:1600,] conds <- c(rep("NS", 7), rep("S", 7)) x.aldex <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE, include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE, gamma=NULL)
# note default is FDR of 0.05 par(mfrow=c(1,3)) aldex.plot(x.aldex, type="MA", test="welch", xlab="Log-ratio abundance", ylab="Difference", main='Bland-Altman plot') aldex.plot(x.aldex, type="MW", test="welch", xlab="Dispersion", ylab="Difference", main='Effect plot') aldex.plot(x.aldex, type="volcano", test="welch", xlab="Difference", ylab="-1(log10(q))", main='Volcano plot')