x
, if you take the order
of the order of x
and use that to order the vector x
ordered by
order x
, you end up with the original vector x
? This fun fact
helps us to implement ‘quantile normalization’, an approach used in
early microarray analyses to make the distribution of gene
expression values in two samples identical (consistent with
underlying model assumptions) without eliminating between-sample
variation of individual genes. We write a short script to perform
quantile normalization on two samples, and then modify our work to
illustrate approaches to robust and efficient code. We (a)
encapsulate the script in a function, to allow convenient re-use;
(b) write a simple ‘unit test’ to make sure that our function
calculates the correct answer; (c) generalize the script to operate
on any number of samples using apply()
on a matrix rather
reference to specific columns in a data frame; (d) improve the
efficiency of the script by replacing an expensive apply()
with
potentially 10’s of thousands of calls to mean()
with a single
call to rowMeans()
; (e) and make our function more robust by
validating the inputs before performing computations. At each of
these stages we use unit tests to make sure that our improvemments
do not undermine the correctness of the original approach. Finally,
we take a quick look at performance to identify possible future
improvements, and take a quick pass through a ‘tidy’ implementation
of quantile normalization.
A fun party trick is to take vector and order it using order()
x <- runif(5)
x
## [1] 0.7809753 0.5210077 0.8932159 0.5110514 0.2235770
xo <- x[order(x)]
xo
## [1] 0.2235770 0.5110514 0.5210077 0.7809753 0.8932159
The original order can the be retrieved by applying order()
twice, and using the result to order xo
xo[order(order(x))]
## [1] 0.7809753 0.5210077 0.8932159 0.5110514 0.2235770
Neat, eh? The ordered vector xo
can be transformed in a way that might be difficult for the original data, for instance finding differences between ordered values
diff(c(0, xo))[order(order(x))]
## [1] 0.259967575 0.009956273 0.112240639 0.287474455 0.223576978
or transforming numeric values into letters, such that the smallest value of x
is given the first letter, etc.
LETTERS[seq_along(xo)][order(order(x))]
## [1] "D" "C" "E" "B" "A"
Create a matrix with two columns representing ‘gene expression’ of 100 (actually, 10,000’s of) genes in two samples. The idea is that ‘base-line’ expression follows some kind of exponential distributition
set.seed(123)
n_genes <- 1000
base <- 2 + rexp(n_genes)
and that the expression observed in each sample is the baseline plus some (normally distributed) error. The following assumes that the size of the measurement error decreases as the gene expression increases; sample y
has (e.g., because of sample processing) a consistently stronger signal than sample x
.
expression <- data.frame(
x = base + rnorm(n_genes, 0.3) / base,
y = 1.1 * base + rnorm(n_genes, 0.4) / base
)
Visualize the expression of each gene in the two samples…
library(ggplot2)
ggplot(expression, aes(x, y)) + geom_point()
and the distribution of expression values across all genes
df <- reshape::melt(expression)
## Using as id variables
ggplot(df, aes(x = value, colour = factor(variable))) +
geom_density()
A commmon hypothesis in, e.g., RNA-seq, is that the expression of most genes is unchanged, and differences between distributions are due to technical artifacts. Under this assumption, we expect the distributions in the second figure, above, to be identical. So we could try to normalize samples so that the distributions of each sample look about the same. Quantile normalization is an extreme form of this normalization, and makes the distributions exactly the same.
Place each column in order
x <- expression$x
xo <- x[order(x)]
y <- expression$y
yo <- y[order(y)]
For our sample of 100, x[1]
is the first quantile, x[2]
the second, etc.
Calculate the mean (or more robust, e.g., median) statistic for each quantile. I’ll do this by using cbind()
to create a two-column matrix, and then apply()
to iterate over each row to calculate the mean
mo <- cbind(xo, yo)
row_mean <- apply(mo, 1, mean)
Use the mean of each quantile in the original vector
normalized <- data.frame(
x = row_mean[order(order(x))],
y = row_mean[order(order(y))]
)
Visualize normalized data – same overall relationship between expression in two samples
ggplot(normalized, aes(x, y)) + geom_point()
…but now identical distributions
df <- reshape::melt(normalized)
## Using as id variables
ggplot(df, aes(x = value, colour = factor(variable))) +
geom_density()
We could write a script, copy/pasting each time we wanted to use this
x <- expression$x
xo <- x[order(x)]
y <- expression$y
yo <- y[order(y)]
mo <- cbind(xo, yo)
row_mean <- apply(mo, 1, mean)
normalized <- data.frame(
x = row_mean[order(order(x))],
y = row_mean[order(order(y))]
)
It would be better to write a function
quantile_normalize <-
function(expression)
{
x <- expression$x
xo <- x[order(x)]
y <- expression$y
yo <- y[order(y)]
mo <- cbind(xo, yo)
row_mean <- apply(mo, 1, mean)
data.frame(
x = row_mean[order(order(x))],
y = row_mean[order(order(y))]
)
}
and use that on different data sets.
normalized <- quantile_normalize(expression)
We’re going to update our function to be more efficient, general, and robust, but we want to have some confidence that the function still works as expected. So we’ll make a much smaller data set that we can verify ‘by hand’ and use as a test case.
expression <- data.frame(
x = c(1, 2, 4, 7),
y = c(3, 2, 5, 8)
)
quantile_normalize(expression)
## x y
## 1 1.5 2.5
## 2 2.5 1.5
## 3 4.5 4.5
## 4 7.5 7.5
We can use a formal concept called a ‘unit test’ to check that our code is correct
library(testthat)
test_that("quantile_normalize() works", {
## calculate simple example 'by hand'...
x <- c(1, 2, 4, 7)
y <- c(3, 2, 5, 8)
xo <- x[order(x)]
yo <- y[order(y)]
mo <- apply(cbind(xo, yo), 1, mean)
expected <- data.frame(x = mo[order(order(x))], y = mo[order(order(y))])
## Compare to outcome of our function
expression <- data.frame(x = x, y = y)
observed <- quantile_normalize(expression)
expect_equal(observed, expected) # other expectations possible
})
Our function assumes that there are exactly two samples, and that the two samples are labelled x
and y
. Let’s try and generalize our function using apply()
to order an arbitrary number of columns,
m <- apply(expression, 2, function(v) v[order(v)])
m
## x y
## [1,] 1 2
## [2,] 2 3
## [3,] 4 5
## [4,] 7 8
and apply the same sort of logic to reconstruct the quantiles
quantile_normalize <-
function(x)
{
m <- apply(x, 2, function(v) v[order(v)])
row_mean <- apply(m, 1, mean)
apply(x, 2, function(v, row_mean) row_mean[order(order(v))], row_mean)
}
Our unit test will fail, because the original function returned a data.frame, whereas the current function returns a matrix. Let’s update the unit test (we could also coerce the matrix to a data.frame before returning from quantile_normalize()
).
test_that("quantile_normalize() works", {
x <- c(1, 2, 4, 7)
y <- c(3, 2, 5, 8)
xo <- x[order(x)]
yo <- y[order(y)]
mo <- apply(cbind(xo, yo), 1, mean)
expected <- cbind(x = mo[order(order(x))], y = mo[order(order(y))])
expression <- data.frame(x = x, y = y)
observed <- quantile_normalize(expression)
expect_equal(observed, expected)
})
apply()
is an iteration – the function is called once for each element of m
. We could instead use rowMeans()
, which is vectorized and makes only one function call for the entire data. This is more efficient in R, so let’s try to modify our function
quantile_normalize <-
function(x)
{
m <- apply(x, 2, function(v) v[order(v)])
row_mean <- rowMeans(m)
apply(x, 2, function(v, row_mean) row_mean[order(order(v))], row_mean)
}
Our unit test still works
test_that("quantile_normalize() works", {
x <- c(1, 2, 4, 7)
y <- c(3, 2, 5, 8)
xo <- x[order(x)]
yo <- y[order(y)]
mo <- apply(cbind(xo, yo), 1, mean)
expected <- cbind(x = mo[order(order(x))], y = mo[order(order(y))])
expression <- data.frame(x = x, y = y)
observed <- quantile_normalize(expression)
expect_equal(observed, expected)
})
so we have some confidence that we’ve changed our implementation without changing the outcome.
Good functions provide something like a contract – if certain types of arguments are provided, then the function will return a certain type of value. In our case we might think of the contract as “if you provide me with a data.frame with all numeric columns, or a numeric matrix, I’ll provide you with a matrix of quantile-normalized columns”. There could be additional constraints on the contract, for instance “the input cannot have NA values” or “dimnames of the input are the same as dimnames of the output”. Checking that the inputs satisfy the contract greatly simplifies our function, and typically provides the user with helpful indications before things go wrong in a cryptic way. So it seems like we should validate the inputs of the function as soon as possible.
quantile_normalize <-
function(x)
{
## so long as the input can be coerced to a matrix...
x <- as.matrix(x)
## and can be validated to conform to our contract...
stopifnot(
is.numeric(x),
!anyNA(x)
)
## ...we have confidence that we will satisfy the return value
m <- apply(x, 2, function(v) v[order(v)])
row_mean <- rowMeans(m)
apply(x, 2, function(v, mo) mo[order(order(v))], mo)
}
We can check that our original unit test is satisfied, but also add aditional tests…
test_that("'quantile_normalize()' validates inputs", {
m <- cbind(letters, LETTERS)
expect_error(quantile_normalize(m))
df <- data.frame(x=rnorm(26), y = letters)
expect_error(quantile_normalize(df))
m <- matrix(rnorm(10), nrow = 5)
m[1,1] <- NA
expect_error(quantile_normalize(m))
})
It seems like our function should preserve dimnames on the original object. The function requires modification, and we can write a new unit test
quantile_normalize <-
function(x)
{
## validate inputs
x <- as.matrix(x)
stopifnot(
is.numeric(x),
!anyNA(x)
)
## quantile normalization
m <- apply(x, 2, function(v) v[order(v)])
row_mean <- rowMeans(m)
result <-
apply(x, 2, function(v, row_mean) row_mean[order(order(v))], row_mean)
## propagate dimnames
dimnames(result) <- dimnames(x)
result
}
test_that("'quantile_normalize()' propagates dimnames", {
m <- matrix(rnorm(10), 5, dimnames=list(LETTERS[1:5], letters[1:2]))
observed <- quantile_normalize(m)
expect_identical(dimnames(observed), dimnames(m))
})
We can also check that our function is robust to less common cases, like inputs with 0 rows and / or 0 columns. These tests lead to additional modifications – apply()
does not always return a matrix!
quantile_normalize <-
function(x)
{
## validate inputs
x <- as.matrix(x)
stopifnot(
is.numeric(x),
!anyNA(x)
)
## quantile normalize
m <- apply(x, 2, function(v) v[order(v)])
dim(m) <- dim(x) # apply() doesn't always return a matrix!
row_mean <- rowMeans(m)
result <- apply(
x, 2, function(v, row_mean) row_mean[order(order(v))], row_mean
)
dim(result) <- dim(x)
## propagate dimnames
dimnames(result) <- dimnames(x)
result
}
test_that("'quantile_normalize()' works with edge cases", {
m <- matrix(rnorm(5), nrow = 5)
expect_identical(m, quantile_normalize(m))
m <- matrix(rnorm(5), ncol = 5)
expect_identical(matrix(mean(m), ncol = 5), quantile_normalize(m))
m <- matrix(0, 0, 0)
expect_identical(m, quantile_normalize(m))
})
We’d like to get a sense of how our algorithm performs for various sized problems. We could use system.time()
to evalute expressions, but a slightly more sophisticated approach replicates each timing to average over differences unrelated to our algorithm.
library(microbenchmark)
n_genes <- 10000
g10 <- matrix(rnorm(n_genes * 10), ncol = 10)
g100 <- matrix(rnorm(n_genes * 100), ncol = 100)
g1000 <- matrix(rnorm(n_genes * 1000), ncol = 1000)
g10000 <- matrix(rnorm(n_genes * 10000), ncol = 10000)
times <- microbenchmark(
quantile_normalize(g10),
quantile_normalize(g100),
quantile_normalize(g1000),
quantile_normalize(g10000),
times = 10
)
times
## Unit: milliseconds
## expr min lq mean median
## quantile_normalize(g10) 13.62116 16.41978 35.16609 20.94962
## quantile_normalize(g100) 141.18307 158.67358 209.78565 167.93833
## quantile_normalize(g1000) 1443.09392 1620.43055 2649.53222 2817.20918
## quantile_normalize(g10000) 18012.71678 20354.24486 22825.75161 21764.57106
## uq max neval
## 26.83651 138.9633 10
## 241.73374 394.6181 10
## 3298.01893 4772.0140 10
## 26625.01458 27220.2120 10
plot(times, log = "y")
The results show that the time to execute the algoritm scales linearly with the number of samples. Even 1000 samples takes only 3.5 seconds to normalize. Our implementation is ‘fast enough’ for many purposes.
Rprof()
can be used to see where our alorithm spends its time
Rprof() # start profiling
result <- quantile_normalize(g10000)
Rprof(NULL) # stop profiling
summaryRprof()$by.total
The output looks like
total.time total.pct self.time self.pct
"quantile_normalize" 32.30 100.00 0.00 0.00
"apply" 31.92 98.82 2.92 9.04
"FUN" 23.42 72.51 6.22 19.26
"order" 17.20 53.25 16.50 51.08
"aperm.default" 2.46 7.62 2.46 7.62
"aperm" 2.46 7.62 0.00 0.00
"unlist" 1.80 5.57 1.80 5.57
"array" 1.32 4.09 1.32 4.09
"vapply" 0.34 1.05 0.24 0.74
"match.arg" 0.34 1.05 0.16 0.50
"rowMeans" 0.22 0.68 0.22 0.68
"anyNA" 0.16 0.50 0.16 0.50
"...elt" 0.16 0.50 0.00 0.00
"stopifnot" 0.16 0.50 0.00 0.00
"formals" 0.10 0.31 0.02 0.06
"match.fun" 0.08 0.25 0.08 0.25
"eval" 0.08 0.25 0.04 0.12
"sys.function" 0.08 0.25 0.02 0.06
"sys.parent" 0.06 0.19 0.06 0.19
"all" 0.02 0.06 0.02 0.06
"c" 0.02 0.06 0.02 0.06
"is.list" 0.02 0.06 0.02 0.06
"logical" 0.02 0.06 0.02 0.06
The top to bottom order represent the approximate time spent ‘inside’ each function. Note the large different ‘inside’ order()
, versus insides aperm()
(our code doesn’t use aperm()
, but some code our code calls does…) and the self.pct
of order()
. This suggests that about 50% of the time is spent performing order()
, and if we could identify ways to avoid calling order()
, or calling order()
more efficiently (e.g., once, rather than 10000 times), our algorithm might increase in speed. Such cleverness usually increases the complexity of the code, and in our case there is no real value in pursuing faster execution time.
There are several issues apparent in the code
order()
is called multiple times, including twice on the same data; it seems like we should be able to do a better job of that.
apply()
is an iteration, and therefore slower than ‘vectorized’ calls. It seems like we could get around the use of apply()
with clever use of functions in the matrixStats package.
We make several copies of the data – each return value is a copy, updating dim()
and dimnames()
likely also makes copies – and this will be expensive both in terms of memory use and time (to allocate and clean up the additional memory).
If it were important enough, each of these areas could be the subject of more detailed effort.
Our approach uses ‘base’ R functionality. Generally, this requires a pretty high level of understanding and abstract thinking, e.g., about apply()
and it’s FUN
and ...
arguments. We also have to think about different data structures – data.frame, matrix, vector, …. A different and recently popular approach is to use ‘tidy’ data. The idea is to always aim for a data representation where a single type of measurement (e.g., gene expression) occupies a single column, with other columns serving to partition the data into groups; the %>%
is a ‘pipe’ that takes the output of the left-hand side and uses it as the input to the right-hand side
library(tidyverse)
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.1 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ dplyr 0.8.0.1
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::is_null() masks testthat::is_null()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::matches() masks testthat::matches()
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
expression
## x y
## 1 1 3
## 2 2 2
## 3 4 5
## 4 7 8
tidy <- melt(expression) %>% as_tibble()
## Using as id variables
tidy
## # A tibble: 8 x 2
## variable value
## <fct> <dbl>
## 1 x 1
## 2 x 2
## 3 x 4
## 4 x 7
## 5 y 3
## 6 y 2
## 7 y 5
## 8 y 8
Tidy analysis emphasize a few ‘verbs’ that have very stereotyped behavior, always returning a ‘tibble’ in long form. So we could use group_by
and mutate()
to update our tbl
with columns representing the order()
step of our normalization
tidy <- tidy %>%
group_by(variable) %>%
mutate(o = order(value))
tidy
## # A tibble: 8 x 3
## # Groups: variable [2]
## variable value o
## <fct> <dbl> <int>
## 1 x 1 1
## 2 x 2 2
## 3 x 4 3
## 4 x 7 4
## 5 y 3 2
## 6 y 2 1
## 7 y 5 3
## 8 y 8 4
Quantile means can be calculated by group using summary()
quantile_mean <- tidy %>%
group_by(o) %>%
summarize(m = mean(value))
quantile_mean
## # A tibble: 4 x 2
## o m
## <int> <dbl>
## 1 1 1.5
## 2 2 2.5
## 3 3 4.5
## 4 4 7.5
The row means can be joined to our tidy data, and the join (using the common column o
to join the two tibbles)
left_join(tidy, quantile_mean)
## Joining, by = "o"
## # A tibble: 8 x 4
## # Groups: variable [2]
## variable value o m
## <fct> <dbl> <int> <dbl>
## 1 x 1 1 1.5
## 2 x 2 2 2.5
## 3 x 4 3 4.5
## 4 x 7 4 7.5
## 5 y 3 2 2.5
## 6 y 2 1 1.5
## 7 y 5 3 4.5
## 8 y 8 4 7.5
This gives us our normalized data in the column m
! As a function, we might have
tidy_quantile_normalize <-
function(tbl)
{
## need to validate...
## quantile normalize
tidy <- tidy %>%
group_by(variable) %>%
mutate(o = order(value))
quantile_mean <- tidy %>%
group_by(o) %>%
summarize(m = mean(value))
left_join(tidy, quantile_mean) %>%
## prepare result
select(variable, value = m) %>%
ungroup()
}
In action, we have
tbl <- melt(expression) %>% as_tibble()
tidy_quantile_normalize(tbl)
## # A tibble: 8 x 2
## variable value
## <fct> <dbl>
## 1 x 1.5
## 2 x 2.5
## 3 x 4.5
## 4 x 7.5
## 5 y 2.5
## 6 y 1.5
## 7 y 4.5
## 8 y 7.5
## R Under development (unstable) (2019-04-04 r76312)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ka36530_ca/R-stuff/bin/R-devel/lib/libRblas.dylib
## LAPACK: /Users/ka36530_ca/R-stuff/bin/R-devel/lib/libRlapack.dylib
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape_0.8.8 forcats_0.4.0 stringr_1.4.0
## [4] dplyr_0.8.0.1 purrr_0.3.2 readr_1.3.1
## [7] tidyr_0.8.3 tibble_2.1.1 tidyverse_1.2.1
## [10] microbenchmark_1.4-6 testthat_2.0.1 ggplot2_3.1.1
## [13] BiocStyle_2.11.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.6 haven_2.1.0
## [4] lattice_0.20-38 colorspace_1.4-1 generics_0.0.2
## [7] htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4
## [10] rlang_0.3.4 pillar_1.3.1 glue_1.3.1
## [13] withr_2.1.2 modelr_0.1.4 readxl_1.3.1
## [16] plyr_1.8.4 munsell_0.5.0 gtable_0.3.0
## [19] cellranger_1.1.0 rvest_0.3.3 codetools_0.2-16
## [22] evaluate_0.13 knitr_1.22 fansi_0.4.0
## [25] broom_0.5.2 Rcpp_1.0.1 scales_1.0.0
## [28] backports_1.1.4 BiocManager_1.30.4 jsonlite_1.6
## [31] hms_0.4.2 digest_0.6.18 stringi_1.4.3
## [34] bookdown_0.9 grid_3.7.0 cli_1.1.0
## [37] tools_3.7.0 magrittr_1.5 lazyeval_0.2.2
## [40] crayon_1.3.4 pkgconfig_2.0.2 xml2_1.2.0
## [43] lubridate_1.7.4 assertthat_0.2.1 rmarkdown_1.12
## [46] httr_1.4.0 rstudioapi_0.10 R6_2.4.0
## [49] nlme_3.1-139 compiler_3.7.0