Good scientific practice: Design of High Throughput Experiments and their Analysis

Susan Holmes, Wolfgang Huber

2022-06-24

Design of High Throughput Experiments

RA Fisher Pioneer of Experimental Design
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
(Fisher 1935) (Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17).

Goals for this Lecture

The Art of “Good Enough”

Types of studies / experiments

Experiment

Retrospective observational studies

Prospective, controlled studies

Meta-analysis

Illustration: experiment

Well-characterized cell line growing in laboratory conditions on defined media, temperature and atmosphere.

We administer a precise amount of a drug, and after 72h we measure the activity of a specific pathway reporter.

Illustration: challenges with studies

We recruited 200 patients that have a disease, fulfill inclusion criteria (e.g. age, comorbidities, mental capacity) and ask them to take a drug each day exactly at 6 am. After 3 months, we take an MRI scan and lots of other biomarkers to see whether and how the disease has changed or whether there were any other side effects.

What to do about this?

Examples

What is a good normalization method?

library("DESeq2")
library("airway")
library("ggplot2")
library("dplyr")
library("gridExtra")
data("airway")
aw = DESeqDataSet(airway, design = ~ cell + dex) %>% estimateSizeFactors
sizeFactors(aw)

samples = c("SRR1039513", "SRR1039517") 

myScatterplot = function(x) {
  as_tibble(x) %>% 
  mutate(rs = rowSums(x)) %>%
  filter(rs >= 2) %>%
  ggplot(aes(x = asinh(SRR1039513), 
             y = asinh(SRR1039517))) + geom_hex(bins = 50) +
    coord_fixed() + 
    geom_abline(slope = 1, intercept = 0, col = "orange") + 
    theme(legend.position = "none")
}

grid.arrange(
  myScatterplot(counts(aw)),
  myScatterplot(counts(aw, normalized = TRUE)),
  ncol = 2)

What do we want from a good normalization method:

Possible figure of merit?

Occam’s razor

William of Ockham William of Ockham

If one can explain a phenomenon without assuming this or that hypothetical entity, there is no ground for assuming it.

One should always opt for an explanation in terms of the fewest possible causes, factors, or variables.

Error models: Noise is in the eye of the beholder

The efficiency of most biochemical or physical processes involving DNA-polymers depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, GC content.

These effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc.

When looking at RNA-Seq data, should we treat GC content as noise or as bias?

One person’s noise can be another’s bias

We may think that the outcome of tossing a coin is completely random.

If we meticulously registered the initial conditions of the coin flip and solved the mechanical equations, we could predict which side has a higher probability of coming up: noise becomes bias.

Tosser P. Diaconis, S. Holmes, et al.

We use probabilistic modelling as a method to bring some order into our ignorance – but let’s keep in mind that these models are a reflection of our subjective ignorance, not an objective property of the world.

Biological versus technical replicates

Imagine we want to test whether a weight loss drug works. Which of the following designs is better?

This example shows the difference between technological versus biological replicates.

Some design decisions in HT biology are similar, if more subtle:

Quiz

For reliable variant calling with current sequencing technology, you need ca. \(30\times\) coverage per genome.

In the 1000 genomes project, the average depth of the data produced was 5.1 for 1,092 individuals. Why was that study design chosen?

What exactly is a technical versus biological replicate?

In fact, the terminology is too simplistic. Error can creep in at many different levels:

We will use the notion of blocks.

If we know about these nuisance factors and have kept track of them, we can (should) include them explicitely as bias terms in our models.

If we did not keep track, we can try to use latent factor models (SVA, PEER, RUV) to identify them from the data.

A lack of units

Pre-modern measurement systems measured lengths in feet, arms, inches (first joint of an index finger), weights in stones, volumes in multiples of the size of a wine jar, etc.

In the International System of Units: meters, seconds, kilograms, Ampères, … are defined based on universal physical constants. A meter measured by a lab in Australia using one instrument has the same meaning as a meter measured a year later by a lab in Stanford using a different instrument, by a space probe in the Kuiper belt, or ETs on Proxima Cen b.

Measurements in biology are, unfortunately, rarely that comparable.

Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference.

Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.

Regular and catastrophic noise

Regular noise can be modelled by simple probability models such as independent normal distributions, Poissons, or mixtures such as Gamma-Poisson or Laplace.

We can use relatively straightforward methods to take noise into account in our data analyses.

The real world is more complicated: measurements can be completely off the scale (a sample swap, a contamination or a software bug), and multiple errors often come together: e.g., a whole microtiter plate went bad, affecting all data measured from it.

Such events are harder to model or even correct for – our best chance to deal with them is data quality assessment, outlier detection and documented removal.

Keeping track: Dailies

A film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage. It is a good idea not to wait until all the runs of an experiment have been finished before looking at the data.

Intermediate data analyses and visualizations will track eventual unexpected sources of variation in the data and enable you to adjust the protocol.

It is important to be aware of sources of variation as they occur and adjust for them.

RNAi screen example

Basic principles in the design of experiments

Clever combinations and balancing: a motivating example

Weighing example A pharmacist’s balance weighing analogy (Hotelling (1944) and Mood (1946)).

Experimental design aims to maximize the available resources. One strategy is to capitalize on cancellations and symmetries. An interesting example is a weighing scheme devised by Hotelling.

Consider a set of eight objects of unknown weights \(m=(m_1,\ldots,m_8)\).

For our simulation study, we create a vector of true \(m\) randomly.

m = sample(seq(2, 16, by = 2), 8)  + round(rnorm(8, 1), 1) %T>% print
## [1] 0.6 0.9 0.4 0.9 1.5 1.0 1.8 1.6

Method 1: Naive method, eight weighings

First, let’s weigh each \(m_i\) individually.

Suppose our scale has errors distributed normally with a SD of 0.1. We simulate the measurements (incl. individual errors) as follows. We also compute the root of the mean of their squares (RMS) as a measure of overall error.

X_1to8 = m + rnorm(length(m), mean = 0, sd = 0.1)
X_1to8
## [1]  8.359209 10.868363  4.298189 16.896361  3.454035 14.967027
## [7]  7.857463 13.452812
errors_1to8 = X_1to8 - m
errors_1to8
## [1] -0.240791027 -0.031636891 -0.101811022 -0.003639091 -0.045965424
## [6] -0.032972940  0.057463277 -0.147188095
rms = function(x) sqrt(mean(x^2))
rms(errors_1to8)
## [1] 0.1104119

Method 2: Hotelling’s method, eight weighings

library("survey")
h8 = hadamard(6)
coef8 = 2 * h8 - 1
coef8
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    1    1    1    1    1    1    1
## [2,]    1   -1    1   -1    1   -1    1   -1
## [3,]    1    1   -1   -1    1    1   -1   -1
## [4,]    1   -1   -1    1    1   -1   -1    1
## [5,]    1    1    1    1   -1   -1   -1   -1
## [6,]    1   -1    1   -1   -1    1   -1    1
## [7,]    1    1   -1   -1   -1   -1    1    1
## [8,]    1   -1   -1    1   -1    1    1   -1

We use the columns of this matrix to guide our new weighing scheme.

The first column says we should put all 8 pieces on one side of the balance and weigh them, call this Y[1].

The second column says: place 1, 3, 5, 7 on one side of the balance and on the other and evaluate the difference. Call the result Y[2].

And so on.

Each of the eight weighings has a normal error with sd = 0.1.

Y = m  %*% coef8 + rnorm(length(m), mean = 0, sd = 0.1)
mhat = Y %*% t(coef8) / 8

Now, because we know the true \(m\), we can compute the errors and their sum of squares:

errors_hotel = as.vector(mhat) - m
errors_hotel
## [1]  0.052211813  0.035408972 -0.051291112  0.027410736 -0.028415118
## [6] -0.006958278  0.034178628 -0.049007538
rms(errors_hotel)
## [1] 0.03838579
rms(errors_1to8) / rms(errors_hotel)
## [1] 2.876375

Proper simulation

We saw that the second method had RMS about 3 times smaller than the first. Were we just lucky?

tcoef8 = t(coef8) / 8
errs = replicate(10000, {
  m = sample(seq(2, 16, by = 2), 8) + round(rnorm(8, 1), 1)
  X_1to8 = m + rnorm(length(m), mean = 0, sd = 0.1)
  err_1to8 = X_1to8 - m
  Y = coef8 %*% m + rnorm(length(m), mean = 0, sd = 0.1)
  mhat = tcoef8 %*% Y
  err_hotel = mhat - m
  c(rms(err_1to8), rms(err_hotel))  
})

library("ggplot2")
library("tibble")
ggplot(tibble(ratio = errs[1,] / errs[2,]), aes(x = log2(ratio))) +
  geom_histogram(bins =  50)

We say that the second scheme is more efficient than the first by a factor of \(\sqrt{8}\) because the RMS errors generated by the measurement are \(\sqrt{8}\) times lower (\(\log_2\sqrt{8}\approx\) 1.5).

One factor at a time?

Ibn Sina (Avicenna) Physician Scientist
Avicenna The Physician His Canon of Medicine (1020) lists seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time.

This dogma was overthrown in the 20th century by RA Fisher.

Comparing two levels of one factor: healthy or diseased.

grid.arrange(p0, p0effect, ncol = 2)

However, suppose the experiment was done in two “batches”, and we color the data according that:

grid.arrange(p0, p0batch, ncol = 2)

We cannot conclude because we are in the presence of confounding.

Now suppose the experiment has higher noise levels, and the same number of points as in the previous study. Then, the sample sizes (2 x 6) is not enough. With the same error and a bigger sample size (2 x 60), it looks better.

grid.arrange(p1, pN1, ncol=2)

The experiment at \(n_1=n_2=6\) is not powerful enough.

t.test(`Biomarker B` ~ states, data = batdef)
## 
##  Welch Two Sample t-test
## 
## data:  Biomarker B by states
## t = -2.7115, df = 9.9937, p-value = 0.02189
## alternative hypothesis: true difference in means between group healthy and group tumor is not equal to 0
## 95 percent confidence interval:
##  -1.1592407 -0.1133796
## sample estimates:
## mean in group healthy   mean in group tumor 
##              1.549076              2.185387

With the same effect size and a larger sample size, we have the power to see the difference:

t.test(`Biomarker C`~ states_60, data = dfN)
## 
##  Welch Two Sample t-test
## 
## data:  Biomarker C by states_60
## t = -7.822, df = 116.83, p-value = 2.56e-12
## alternative hypothesis: true difference in means between group healthy and group tumor is not equal to 0
## 95 percent confidence interval:
##  -0.5516679 -0.3287520
## sample estimates:
## mean in group healthy   mean in group tumor 
##              1.493643              1.933853

Lessons from these toy examples

Depends on:

  1. Effect size (unchangeable)
  2. Control and documentation of all factors (block effects, date / time / operator etc.).
  3. Noise (variability in measurements)
  4. Sample size: remember the standard error of the mean is \[\frac{\sigma^2}{n}\]

Decomposition of variability: analysis of variance.

Blocking: the case of paired experiments.

ZeaMays Each pot in Darwin’s Zea Mays experiment is a block, only the factor of interest should be different (pollination method), all other factors should be kept equal within a block.
A balanced design is an experimental design where all the different factor combinations have the same number of observation replicates. Such data are particularly easy to analyse because the effect of each factor is identifiable.
When there are (likely) nuisance factors, it is good to make sure they are balanced with the factors of interest. Sometimes this is inconvenient or impractical for logistic or economic reasons – but in such cases analysts are on thin ice and need to proceed with caution.

Comparing paired versus unpaired design

When comparing various possible designs, we do power simulations.

Here, we suppose the sample size is 15 in each group and the effect size is 0.2. We also need to make assumptions about the standard deviations of the measurements, here we suppose both groups have the same sd=0.25.

n = 15
effect = 0.2
pots   = rnorm(15, 0, 1)
noiseh = rnorm(15, 0, 0.25)
noisea = rnorm(15, 0, 0.25)
autoz  = pots + noisea
hybrid = pots + effect + noiseh

Perform a simple and a paired \(t\)-test. Which is more powerful in this case?

t.test(hybrid, autoz, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  hybrid and autoz
## t = 0.58292, df = 27.908, p-value = 0.5646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6118805  1.0985542
## sample estimates:
##  mean of x  mean of y 
##  0.2016872 -0.0416497
t.test(hybrid, autoz, paired = TRUE)
## 
##  Paired t-test
## 
## data:  hybrid and autoz
## t = 3.3113, df = 14, p-value = 0.005145
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.08572565 0.40094807
## sample estimates:
## mean difference 
##       0.2433369

Maybe we were just lucky (due to randomness in our simulation) here.

To confirm, run a parametric bootstrap experiment. Generate data as above B=1000 times and compute the average probability of rejection, if allowing for a false positive rate \(\alpha=0.05\).

pvals = replicate(1000, {
  pots   = rnorm(15,0,1)
  noiseh = rnorm(15,0,0.25)
  noisea = rnorm(15,0,0.25)
  autoz  = pots + noisea
  hybrid = pots + effect + noiseh
  sapply(c(FALSE, TRUE), function(prd) 
    setNames(t.test(hybrid, autoz, paired = prd)$p.value, paste(prd))
  )
}) 
rowMeans(pvals < 0.05)
## FALSE  TRUE 
## 0.000 0.538

We can plot the p-values obtained using both methods:

library("tidyr")
pivot_longer(as_tibble(t(pvals)), cols = rownames(pvals)) %>%
  transform(name = paste("paired =", name)) %>%
  ggplot(aes(x = value, fill = name)) + 
    geom_histogram(binwidth = 0.02, alpha = exp(-1))

Exercises

powercomparison = function(potsd = 1, noisesd = 0.25, n = 15, effect = 0.2,
                           B = 1000, pthresh = 0.05) {
  pv = replicate(B, {
    pots   = rnorm(n, 0, potsd)
    noiseh = rnorm(n, 0, noisesd)
    noisea = rnorm(n, 0, noisesd)
    autoz  = pots + noisea
    hybrid = pots + effect + noiseh
    sapply(c(FALSE, TRUE), function(prd) 
      setNames(t.test(hybrid, autoz, paired = prd)$p.value, paste(prd))
    )
  })
  rowMeans(pv < pthresh)
}

Here are a few examples showing that when the pots sd is smaller than the noise sd, pairing hardly makes a difference. If the pots variability is larger than that of the measurement noise, pairing makes a big difference.

powercomparison(potsd = 0.5,  noisesd = 0.25)
## FALSE  TRUE 
## 0.037 0.515
powercomparison(potsd = 0.25, noisesd = 0.5)
## FALSE  TRUE 
## 0.134 0.175
powercomparison(potsd = 0.25, noisesd = 0.1)
## FALSE  TRUE 
## 0.511 0.998
powercomparison(potsd = 0.1,  noisesd = 0.25)
## FALSE  TRUE 
## 0.472 0.510

For 100 plants of each type and the two SDs being 0.5, the power of the paired test is nearly 80%.

powercomparison(potsd = 0.5, noisesd = 0.5, n = 100)
## FALSE  TRUE 
## 0.512 0.787

Lesson: take into account a natural pairing of the observations – for instance, studies of patients before and after a treatment.

What can be done when pairing is not available?

Try to create pairs of subjects that have as much similarity as possible through matching age, gender (sex), background health etc. One is treated, the other serves as a control.

“Block what you can, randomize what you cannot”

(George Box, 1978)

Often we don’t know which nuisance factors will be important, or we cannot plan for them ahead of time.

In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.

Randomized Complete Block Design

“The design space is divided into uniform units to account for any variation so that observed differences within units are largely due to true differences between treatments. Treatments are then assigned at random to the subjects in the blocks - once in each block. The defining feature of the Randomized Complete Block Design is that each block sees each treatment exactly once.” (Trudi Grant)

Balanced Incomplete Block Design

Complete Factorial Latin Squares

sudoku anyone?

(Latin square: n × n array filled with n different symbols, each occurring exactly once in each row and exactly once in each column)

Randomization decreases bias

Randomization helps inference

Random does not mean haphazardly

Controls, positive and negative: why?

We often need to remove variation due to unknown factors, or decompose variability according to its different sources; this is classically done using analysis of variance that can accomodatefactors such as subject or batch effects.

Usually these decompositions require at least 3 replicate measurements in each cell.

Removal of effects from unknown sources can only be done through the use of negative controls.

Calibration of the effect size in an experiment also requires the use of positive controls; spike-ins (for instance External RNA Control Consortium controls as used in Risso et al. (2014)) where a known quantity or a known expression level aid in these calibrations and are a standard part of many experimental protocols.

Validation in independent data / independent lines of argument

If it is too good to be true, it usually is….

Synonymous mutations in representative yeast genes are mostly strongly non-neutral

How many replicates do I need?

We saw how we can use preliminary data and simulations to calculate how many replicates are needed to achieve, say, a 80% true positive rate, if we know the effect size.

Now, recall the discussion of experiments versus studies.

For the cell line experiment, we might get the correct result already from one replicate; usually we’ll do two or three to be sure.

On the other hand, for the study, our intuition tells us that there is so much uncontrolled variability that 20 is likely far too few, and we may need 200 or 2,000 patients for a reliable result. The number depends on the amount of uncontrolled variability, and the effect size. A pragmatic approach is to check out previous successful (or unsuccessful) experiments or studies that did something comparable and use simulations, subsampling or bootstrapping to get an estimate of our proposed new study’s power. Here are more details about how to go about this in practice.

Power depends on sample sizes, effect sizes and variability.

Effective sample size for dependent data.

Dependent data require larger samples (intuition: each sample contains less information)

A sample of independent observations is more informative than the same number of dependent observations.

Suppose you want to do an opinion poll by knocking at people’s doors and asking them a question.

In both cases, the number of people polled is \(n\), but if we assume that people living in the same neighborhood are more likely to have the same opinion, the data from Scenario 2 are (positively) correlated. To explore this, let’s do a simulation.

doPoll = function(n = 100, numPeoplePolled = 12) {
  opinion = sort(rnorm(n))
  i1 = sample(n, numPeoplePolled)
  i2 = sample(seq(3, n, by = 3), numPeoplePolled / 3)
  i2 = c(i2, i2 - 1, i2 - 2)
  c(independent = mean(opinion[i1]), correlated = mean(opinion[i2]))
}
responses = replicate(5000, doPoll())
ggplot(pivot_longer(as_tibble(t(responses)), cols = rownames(responses)), 
       aes(x = value, col = name)) + geom_density() +
  geom_vline(xintercept = 0) + xlab("Opinion poll result")
Density estimates for the polling result using the two sampling methods. The correlated method has higher spread. The truth is indicated by the vertical line.

Density estimates for the polling result using the two sampling methods. The correlated method has higher spread. The truth is indicated by the vertical line.

There are 100 people in the country, of which in the first approach (i1) we randomly sample 12. In the second approach, we sample 4 people as well as two neighbors for each (i2). The opinion in our case is a real number, normally distributed in the population with mean 0 and standard deviation 1.

We model the spatio-sociological structure of our country by sorting the houses from most negative to most positive opinion in the first line of the doPoll function.

Time course: equalize the dependence structure by taking more frequent samples when there is more “going on”.

Response to a transient perturbation

Mean-Variance Relationships and Transformations

You’ve seen examples for data transformations that compress or stretch the space of quantitativemeasurements in such a way that the measurements’ noise variance (not: the signal variance) is more similar throughout – i.e., no longer highly dependent on the mean value.

The mean-variance relationship of a measurement technology can in principle be any function, but in many cases, the following prototyic relationships hold:

  1. constant: the variance is independent of the mean (“additive noise”, \(Y=\mu+\varepsilon\))

  2. Poisson: the variance is proportional to to the mean (“shot noise”, \(N\sim\text{Poi}(\mu)\))

  3. quadratic: the standard deviation is proportional to the mean, therefore the variance grows quadratically (“multiplicative noise”, \(Y=\mu \cdot e^\varepsilon\))

Do you know examples for biological assays (or measurement technologies) whose data show these types of relationships?

Real data can also be affected by a combination of these. For instance:

We recall from Chapter 5 that for data with a mean-variance relationship \(v(\mu)\) the variance-stabilizing transformation \(g:\mathbb{R}\to\mathbb{R}\) fulfills the condition \[g'(x)=\frac{1}{\sqrt{v(x)}}\]

  1. What are the variance-stabilizing transformations associated with the above three prototypic mean-variance relationships?
  2. What is the variance stabilizing transformation that is appropriate for \(v(\mu) = v_0 + c\,\mu^2\), where \(v_0>0\) is a positive constant?

Data quality assessment and quality control

These pervade all phases of an analysis, from assembling the raw data over transformation, summarization, model fitting, hypothesis testing or screening for ``hits’’ to interpretation. QA-related questions include:

For the last two sets of questions, heatmaps, principal component plots and other ordination plots are useful.

Henry Ford’s (possibly apocryphal) quote: ``If I had asked people what they wanted, they would have said faster horses.’’ (https://corporate.ford.com/history.html) expresses the view of quality as fitness for purpose, versus adherence to specifications.

It is not easy to nail down (mathematically define) quality, and the word is used with many meanings. See also http://en.wikipedia.org/wiki/Quality_(business)

Longitudinal Data

Longitudinal data have time as a covariate.

Don’t Pretend You Are Dumb

There is some attraction to seemingly “unbiased” approaches that analyse the data at hand without any reference to what is already known. Such tendencies are reinforced by the fact that statistical methods have often been developed to be generic, for instance, working of a general matrix without specific reference to what the rows and column might signify.

Generic approaches are a good way to get started, and for analyses that are highly powered and straightforward, such an an approach might work out. But often, it is wasteful. Recall the example of an RNA-Seq experiment for differential expression. As we saw in the RNA-Seq and Hypothesis Testing lectures, we could perform a hypothesis test for differential expression for each annotated gene, regardless of its read counts, and then run a multiple testing method that treats all tests as exchangeable. But this is inefficient - we can improve our detection power by filtering out, or downweighting, the hypotheses with lower test power, or with lower prior probability of being alternative.

Other examples include:

Leaky pipelines and statistical sufficiency

Data analysis pipelines in high-throughput biology often work as funnels that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then “normalize” these numbers by library size, etc.

At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.

Statisticians have a concept for whether certain summaries enable the reconstruction of all the relevant information in the data: sufficiency. E.g., in a Bernoulli random experiment with a known number of trials, \(n\), the number of successes is a sufficient statistic for estimating the probability of success \(p\).

In a 4-state Markov chain (A,C,G,T) such as the one we saw in Lecture 2, what are the sufficient statistics for the estimation of the transition probabilities?

Iterative approaches akin to what we saw when we used the EM algorithm can sometimes help avoid information loss. For instance, when analyzing mass spectroscopy data, a first run guesses at peaks individually for every sample. After this preliminary spectra-spotting, another iteration allows us to borrow strength from the other samples to spot spectra that may have been labeled as noise.

Sharpen Your Tools: Reproducible Research

Analysis projects often begin with a simple script, perhaps to try out a few initial ideas or explore the quality of the pilot data. Then more ideas are added, more data come in, other datasets are integrated, more people become involved. Eventually the paper needs to be written, figures be redone properly, and the analysis be saved for the scientific record and to document its integrity. Here are a few tools that can help with such a process.

Use an integrated development environment. RStudio is a great choice; there are also other platforms such as Emacs or Eclipse.

Use literate programming tools such as Rmarkdown or Jupyter. This is more readable (for yourself and for others) than burying explanations and usage instructions in comments in the source code or in separate README files, in addition you can directly embed figures and tables in these documents. Such documents are often good starting points for the supplementary material of your paper.

Anticipate re-engineering of the data formats and the software. The first version of how you choose to represent the data and structure the analysis workflows will rarely be the best. Don’t be afraid to make a clean cut and redesign them as soon as you notice that you are doing a lot of awkward data manipulations or repetitive steps. This is time well-invested. Sometimes it also helps to unearth bugs.

Reuse existing tools. Don’t reinvent the wheel and rather spend your time on things that are actually new. Before implementing a ‘heuristic’ or a temporary hack that analyses your data in a certain way, spend a couple of minutes researching to see if something like this hasn’t been done before. More often than not, it has, and there is a clean, scalable, and already tested solution.

Use version control, such as git. This takes some time to learn, but in the long run will be infinitely better than all your self-grown attempts at managing evolving code with version numbers, switches and the like. Moreover, this is also the sanest option for collaborative work on code, and it provides an extra backup of your codebase, especially if the server is distinct from your workplace machine.

Use functions rather than copy-pasting stretches of code.

Use the R package system. Soon you’ll note recurring function or variable definitions that you want to share between your individual scripts. It is fine to use the R function to manage them initially, but it is never to early to move them into your own package at the latest when you find yourself starting to write README files or long emails explaining others how to use some script or another. Assembling existing code into an R package is not hard by any means, and offers you many goodies including standardized and convenient ways to provide documentation, to show code usage examples, to test the correct functioning of your code, and to version it. Quite likely you’ll soon appreciate the benefit of using namespaces.

Centralize the location of the raw data files and streamline the derivation of intermediate data. Store the input data at a centralized file server that is professionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g., normalised, summarised, transformed etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be re-run several times, and version it. Use or similar tools to mirror these files on your personal computer.

Integration. When developing downstream analysis ideas that bring together several different data types, you don’t want to do the conversion from data type specific formats to the representations that machine learning or generic statistical methods use each time on an ad hoc basis. Have a recipe script that assembles the different ingredients and cooks them up as an easily consumable matrix, data frame or Bioconductor.

Keep a hyperlinked webpage with an index of all analyses. This is helpful for collaborators (especially if the page and the analysis can be accessed via a web browser) and also a good starting point for the methods part of your paper. Structure it in chronological or logical order, or a combination of both.

Data representation

Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine.

Errors can occur, lost labels, lost information: be safe, redundancy is good.

Wide vs long table format

Recall Hiiragi data (for space reasons we print only the first five rows and columns):

data("x", package = "Hiiragi2013")
dfw = as.data.frame(exprs(x))
dfw[1:5, 1:5]
##                1 E3.25   2 E3.25  3 E3.25   4 E3.25   5 E3.25
## 1415670_at    4.910459  7.526672 6.956328  6.424048  5.105808
## 1415671_at    9.768979  9.144228 9.295010 11.059831  9.376749
## 1415672_at   10.411893 10.918942 9.495738 10.317996 11.143684
## 1415673_at    5.618108  6.439416 6.730465  4.914527  5.619778
## 1415674_a_at  7.541891  8.380285 8.480580  7.977363  8.650312
dim(dfw)
## [1] 45101   101

This dataframe has columns of data, one for each sample (annotated by the column names). Its rows correspond to the probes, annotated by the probe identifiers. This is an example for a data table in wide format.

dfw$probeid = rownames(dfw)
dfl = pivot_longer(dfw, 
            cols = colnames(x),
            names_to = "Embryonic Day") %T>% print
## # A tibble: 4,555,201 × 3
##    probeid    `Embryonic Day` value
##    <chr>      <chr>           <dbl>
##  1 1415670_at 1 E3.25          4.91
##  2 1415670_at 2 E3.25          7.53
##  3 1415670_at 3 E3.25          6.96
##  4 1415670_at 4 E3.25          6.42
##  5 1415670_at 5 E3.25          5.11
##  6 1415670_at 6 E3.25          5.86
##  7 1415670_at 7 E3.25          5.06
##  8 1415670_at 8 E3.25          4.57
##  9 1415670_at 9 E3.25          8.12
## 10 1415670_at 10 E3.25         5.46
## # … with 4,555,191 more rows

In dfl, each row corresponds to exactly one measured value, stored in the column named value. Then there are additional columns that store the associated covariates.

Now suppose we want to store somewhere not only the probe identifiers but also the associated gene symbols.

In dfw, we could stick them as an additional column, and perhaps also throw in the genes’ ENSEMBL identifier for good measure. But now we have a problem: the dataframe now has some columns that represent different samples, and others that refer to information for all samples (the gene symbol and identifier) and we somehow have to know this when interpreting the dataframe. This is what Hadley Wickham calls untidy data.

In contrast, in dfl, we can add these columns, yet still know that each row forms exactly one observation, and all information associated with that observation is in the same row.

Tidy data

In tidy data

  1. each variable forms a column,

  2. each observation forms a row,

  3. each type of observational unit forms a table.

Using tidy data wisely

See also https://www.huber.embl.de/msmb/Chap-Design.html#tidy-data-using-it-wisely

Efficiency (lack of)

Even though there are only 45101 probe-gene symbol relationships, we are storing them 4555201 times in the rows of dfl.

Standardization (lack of)

We may choose to call these columns probeid and geneid, but the next person might call them probe_id and gene_id, or even something completely different. When we find a dataframe that was made by someone else and that contains such-named columns, we may hope, but have no guarantees, that these are valid gene symbols.

Addressing such issues is behind the object-oriented design of the specialized data structures in Bioconductor, such as the class SummarizedExperiment.

Matrices versus dataframes

Representation of the matrix may also simply be more natural and mathematically stringent - e.g., if we want to apply PCA / dimension reduction, or classification.

Summary

There are two types of variation in an experiment or study: those of interest, those that are unwanted.

We usually cannot rid ourselves of all the unwanted variation, but we saw how we can used balanced randomized designs, data transformations.

Reproducible workflows

Open science: make data and code available

Further Reading

Book chapter

Lecture and chapter presented merely a pragmatic introduction to design, there are many book long treatments of the subject that offer detailed advice on setting up experiments: Wu and Hamada (2011); Box, Hunter, and Hunter (1978); Glass (2007)

References

Box, George EP, William G Hunter, and J Stuart Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons.
Fisher, Ronald Aylmer. 1935. The Design of Experiments. Oliver & Boyd.
Glass, David J. 2007. Experimental Design for Biologists. Cold Spring Harbor Laboratory Press.
Hotelling, Harold. 1944. “Some Improvements in Weighing and Other Experimental Techniques.” The Annals of Mathematical Statistics 15 (3): 297–306.
Mood, Alexander M. 1946. “On Hotelling’s Weighing Problem.” The Annals of Mathematical Statistics, 432–46.
Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-Seq Data Using Factor Analysis of Control Genes or Samples.” Nature Biotechnology 32 (9): 896–902.
Rocke, David M, and Blythe Durbin. 2001. “A Model for Measurement Error for Gene Expression Arrays.” Journal of Computational Biology 8 (6): 557–69.
Wu, CF Jeff, and Michael S Hamada. 2011. Experiments: Planning, Analysis, and Optimization. Vol. 552. John Wiley & Sons.