Michael Love
June 8, 2015
(from PH525x notes and Barry, Nobel and Wright, 2008)
Consider the average t-statistic from \( N \) genes in a set \( G \):
\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]
This statistic \( \bar{t} \) combines the information about DE from the set and might be a useful test statistic.
Under the null hypothesis, the \( t \) have mean 0. If the \( t \) are independent then \( \sqrt{N} \bar{t} \) has standard deviation 1 and is approximately normal:
\[ \sqrt{N} \bar{t} \sim N(0,1) \]
This comes from the following decomposition of the variance:
\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}(t_1 + \dots + t_N) \\ &=& \frac{1}{N^2}( \text{Var}(t_1) + \dots + \text{Var}(t_N) ) \\ &=& \frac{1}{N} \end{eqnarray} \]
Now consider the case that the test statistics \( t_i \) in a gene set are not independent but have correlation \( \rho \) under the null hypothesis.
\[ \bar{t}=\frac{1}{N} \sum_{i \in G} t_i \]
\[ \text{corr}(t_i, t_{i'}) = \rho, \quad i, i' \in G \]
The variance of the average t-statistics will be:
\[ \begin{eqnarray} \text{Var}(\bar{t}) &=& \frac{1}{N^2} \text{Var}( (1 \dots 1) (t_1 \dots t_N)' ) \\ &=& \frac{1}{N^2}(1 \dots 1) \begin{pmatrix} 1 & \rho & \dots & \rho & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \dots & \dots & \dots & \dots & \dots \\ \rho & \rho & \dots & \rho & 1 \\ \end{pmatrix} (1 \dots 1) ' \\ &=& \frac{1}{N^2}\{N + (N-1) N \rho \} \\ &=& \frac{1}{N}\{1 + (N-1) \rho \} \end{eqnarray} \]
So the variance inflation factor (VIF) comparing the independent case to the case with correlation is:
\[ VIF = 1 + (N-1) \bar{\rho} \]
So the increased width (standard deviation) of the null distribution for a gene set with 20 genes and average correlation 0.1 will be:
sqrt(1 + 19 * 0.1)
[1] 1.702939
This VIF is approximately true also for testing the set statistics against the complement: the genes not in the set (see Barry, Nobel and Wright 2008).
Here, the expression of 5 samples vs 5 samples, no true difference but a correlation of gene expression.
If the test statistic \( T \) is a linear form of the data \( X \) (e.g. log fold change), then:
\[ \rho^T_{i,i'} = \rho^X_{i,i'} \]
For t-test, the relationship is monotone, approximately linear and:
\[ \rho^T_{i,i'} \approx \rho^X_{i,i'} \]
(Barry, Nobel and Wright, 2008)