0.1 Introduction

Global normalization methods such as quantile normalization [@Bolstad2003], [@Irizarry2003] have become a standard part of the analysis pipeline for high-throughput data to remove unwanted technical variation. These methods and others that rely solely on observed data without external information (e.g. spike-ins) are based on the assumption that only a minority of genes are expected to be differentially expressed (or that an equivalent number of genes increase and decrease across biological conditions [@aanes2014normalization]). This assumption can be interpreted in different ways leading to different global normalization procedures. For example, in one normalization procedure, the method assumes the mean expression level across genes should be the same across samples [@Robinson2010]. In contrast, quantile normalization assumes the only difference between the statistical distribution of each sample is technical variation. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference [@Bolstad2003].

While these assumptions may be reasonable in certain experiments, they may not always be appropriate [@Loven2012], [@Hicks2015]. For example, mRNA content has been shown to fluctuate significantly during zebrafish early developmental stages [@aanes2014normalization]. Similarly, cells expressing high levels of c-Myc undergo transcriptional amplification causing a 2 to 3 fold change in global gene expression compared to cells expressing low c-Myc [@Loven2012].

Recently, an R/Bioconductor package (quantro) [@Hicks2015] has been developed to test for global differences between groups of distributions to evaluate whether global normalization methods such as quantile normalization should be applied. If global differences are found between groups of distributions, these changes may be of technical or biological of interest. If these changes are of technical interest (e.g. batch effects), then global normalization methods should be applied. If these changes are related to a biological covariate (e.g. normal/tumor or two tissues), then global normalization methods should not be applied because the methods will remove the interesting biological variation (i.e. differentially expressed genes) and artifically induce differences between genes that were not differentially expressed. In the cases with global differences between groups of distributions between biological conditions, quantile normalization is not an appropriate normalization method. In these cases, we can consider a more relaxed assumption about the data, namely that the statistical distribution of each sample should be the same within biological conditions or groups (compared to the more stringent assumption of quantile normalization, which states the statistical distribution is the same across all samples).

In this vignette we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is a weighted average of the two types of assumptions about the data. The qsmooth R-package contains the qsmooth() function, which computes a weight at every quantile that compares the variability between groups relative to within groups. In one extreme, quantile normalization is applied and in the other extreme quantile normalization within each biological condition is applied. The weight shrinks the group-level quantile normalized data towards the overall reference quantiles if variability between groups is sufficiently smaller than the variability within groups. The algorithm is described in Figure 1 below.

Let gene(g) denote the \({g}^{th}\) row after sorting each column in the data. For each row, gene(g), we compute the weight \(w_{(g)} \in [0, 1]\), where a weight of 0 implies quantile normalization within groups is applied and a weight of 1 indicates quantile normalization is applied. The weight at each row depends on the between group sum of squares \(\hbox{SSB}_{(g)}\) and total sum of squares \(\hbox{SST}_{(g)}\), as follows:

\[ w_{(g)} = \hbox{median} \{1 - \hbox{SSB}_{(i)} / \hbox{SST}_{(i)} | ~i = g-k, \dots, g, \dots, g+k \} \] where \(k=\) floor(Total number of genes * 0.05). The number 0.05 is a flexible parameter that can be altered to change the window of the number of genes considered. In this way, we can use a rolling median to borrow information from neighboring genes in the weight.

knitr::include_graphics("qsmooth_algo.jpg")
Figure 1

Figure 1: Figure 1

1 Getting Started

Load the package in R

library(qsmooth)

2 Data

The bodymapRat package contains an SummarizedExperiment of 652 RNA-Seq samples from a comprehensive rat transcriptomic BodyMap study. This data was derived from the raw FASTQ files obtained from Yu et al. (2014) . It contains expression levels from 11 organs in male and female rats at 4 developmental stages. We will use a subset of this data in this vignette.

2.1 bodymapRat example - Comparing two tissues

This example is based a dataset which contains brain and liver tissue samples from 21 week old male and female rats. eight samples are from males and eight samples are from females.

library(SummarizedExperiment)
library(bodymapRat)
bm_dat <- bodymapRat()

# select brain and liver samples, stage 21 weeks, and only bio reps
keepColumns = (colData(bm_dat)$organ %in% c("Brain", "Liver")) &
         (colData(bm_dat)$stage == 21) & (colData(bm_dat)$techRep == 1)
keepRows = rowMeans(assay(bm_dat)) > 10 # Filter out low counts
bm_dat_e1 <- bm_dat[keepRows,keepColumns]
bm_dat_e1
## class: SummarizedExperiment 
## dim: 18629 16 
## metadata(0):
## assays(1): counts
## rownames(18629): ENSRNOG00000000007 ENSRNOG00000000010 ... ERCC-00170
##   ERCC-00171
## rowData names(0):
## colnames(16): SRR1169975 SRR1169977 ... SRR1170275 SRR1170277
## colData names(22): sraExperiment sraRun ... xtile ytile

3 Using the qsmooth() function

3.1 Input for qsmooth()

The qsmooth() function must have two objects as input:

  1. object: a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns. qsmooth() accepts objects which are a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns.
  2. group_factor: a continuous or categorial covariate that represents the group level biological variation about each sample. For example if the samples represent two different tissues, provide qsmooth() with a covariate representing which columns in the object are different tissue samples.
  3. batch: optional batch covariate (multiple batches are not allowed). If batch covariate is provided, ComBat() from sva is used prior to qsmooth normalization to remove batch effects. See ComBat() for more details.
  4. norm_factors: optional scaling normalization factors. Default is NULL. If norm_factors is not equal to NULL, the user can provide a vector of scaling factors that will be used to modify the expression data set prior to applying the qsmooth algorithm.
  5. window: window size for running median (defined as a fraction of the number of rows of the data object. Default is 0.05.

3.2 Running qsmooth()

3.2.1 bodymapRat example - Comparing two tissues

Here, the groups we are interested in comparing are the two types of tissues in the 21 week old male and female rats. The groups we are interested in comparing is contained in the organ column in the colData(bm_dat_e1) dataset. To run the qsmooth() function, input the data object and the object containing the phenotypic data.

The first row shows the boxplots and density plot of the raw data that has been transformed on the log2() scale and added a pseudo-count of 1 (i.e.  log2(counts+1)).

The second row shows the boxplots and density plot of the qsmooth normalized data.

library(quantro)

par(mfrow=c(2,2))
pd1 <- colData(bm_dat_e1)
counts1 <- assay(bm_dat_e1)[!grepl("^ERCC", 
                      rownames( assay(bm_dat_e1))), ]
pd1$group <- paste(pd1$organ, pd1$sex, sep="_")

matboxplot(log2(counts1+1), groupFactor = factor(pd1$organ),
           main = "Raw data", xaxt="n", 
           ylab = "Expression (log2 scale)")
axis(1, at=seq_len(length(as.character(pd1$organ))),
     labels=FALSE)
text(seq_len(length(pd1$organ)), par("usr")[3] -2, 
     labels = pd1$organ, srt = 90, pos = 1, xpd = TRUE)

matdensity(log2(counts1+1), groupFactor = pd1$organ, 
           main = "Raw data", ylab= "density",
           xlab = "Expression (log2 scale)")
legend('topright', levels(factor(pd1$organ)), 
       col = 1:2, lty = 1)

qs_norm_e1 <- qsmooth(object = counts1, group_factor = pd1$organ)
qs_norm_e1 
## qsmooth: Smooth quantile normalization
##    qsmoothWeights (length = 18581 ): 
## 0.933 0.933 0.933 ... 0.887 0.887 0.887 
##    qsmoothData (nrows = 18581 , ncols = 16 ): 
##                     SRR1169975  SRR1169977  SRR1169979  SRR1169981  SRR1170009
## ENSRNOG00000000007 2174.512012 1801.779468 1713.209769 1947.183795 2009.479744
## ENSRNOG00000000010  408.563196  340.283829  438.121141  495.806792  488.897971
## ENSRNOG00000000012    2.280523    2.118090    1.471557    3.036711    1.431085
## ENSRNOG00000000017    5.506024    6.860026    5.087043    1.605515    6.692905
## ENSRNOG00000000021   93.035004  109.028768   98.549182  113.522437   89.975354
## ENSRNOG00000000024   20.060899   30.947871   35.232957   37.354217   28.396052
##                     SRR1170011   SRR1170013  SRR1170015  SRR1170239  SRR1170241
## ENSRNOG00000000007 1753.942781 1830.5599808 1836.751451   3.5632320   3.9628398
## ENSRNOG00000000010  533.290776  479.6564095  500.205500   0.6699115   1.5252205
## ENSRNOG00000000012    1.821102    0.7560356    2.969323   0.6699115   2.7658111
## ENSRNOG00000000017    1.193240    5.8792008    6.034787   2.5774451   0.2429941
## ENSRNOG00000000021  100.365576   92.8453323   88.179494  27.8363393  37.4890104
## ENSRNOG00000000024   19.681690   25.2376300   35.361946 178.5266277 197.0539613
##                     SRR1170243  SRR1170245 SRR1170271 SRR1170273  SRR1170275
## ENSRNOG00000000007   4.5519846   3.7718210   6.134075   3.780457   6.2792639
## ENSRNOG00000000010   0.2404039   0.2911755   1.949617   2.618445   0.1800509
## ENSRNOG00000000012   2.5981055   1.5737439   1.949617   0.236370   0.1800509
## ENSRNOG00000000017   2.5981055   1.5737439   1.949617   2.618445   0.1800509
## ENSRNOG00000000021  32.4504495  44.2178479  32.026870  40.259346  32.0963011
## ENSRNOG00000000024 192.9318750 206.2424800 199.157984 179.002150 135.9229221
##                     SRR1170277
## ENSRNOG00000000007   0.1944369
## ENSRNOG00000000010   1.3573017
## ENSRNOG00000000012   0.1944369
## ENSRNOG00000000017   4.8631190
## ENSRNOG00000000021  48.9281491
## ENSRNOG00000000024 171.2001833
##  .......
matboxplot(log2(qsmoothData(qs_norm_e1)+1), 
           groupFactor = pd1$organ, xaxt="n",
           main = "qsmooth normalized data", 
           ylab = "Expression (log2 scale)")
axis(1, at=seq_len(length(pd1$organ)), labels=FALSE)
text(seq_len(length(pd1$organ)), par("usr")[3] -2, 
     labels = pd1$organ, srt = 90, pos = 1, xpd = TRUE)

matdensity(log2(qsmoothData(qs_norm_e1)+1), groupFactor = pd1$organ,
           main = "qsmooth normalized data",
           xlab = "Expression (log2 scale)", ylab = "density")
legend('topright', levels(factor(pd1$organ)), col = 1:2, lty = 1)