Resampling Methods

Levi Waldron, CUNY School of Public Health

levi.waldron@sph.cuny.edu

waldronlab.github.io / waldronlab.org

June 15, 2017

Outline and introduction

ISLR Chapter 5: James, G. et al. An Introduction to Statistical Learning: with Applications in R. (Springer, 2013). This book can be downloaded for free at http://www-bcf.usc.edu/~gareth/ISL/getbook.html

Why do regression?

Inference

Bootstrap, permutation tests

Why do regression? (cont’d)

Prediction

Cross-validation

Cross-validation

Why cross-validation?

Figure 2.9 B

Figure 2.9 B

Under-fitting, over-fitting, and optimal fitting

K-fold cross-validation approach

  1. Randomly sample \(1/K\) observations (without replacement) as the validation set
  2. Use remaining samples as the training set
  3. Fit model on the training set, estimate accuracy on the validation set
  4. Repeat \(K\) times, not using the same validation samples
  5. Average validation accuracy from each of the validation sets
3-fold CV

3-fold CV

Variability in cross-validation

Variability of 2-fold cross-validation (ISLR Figure 5.2)

Variability of 2-fold cross-validation (ISLR Figure 5.2)

Cross-validation summary

Cross-validation caveats

http://hunch.net/?p=22

Cross-validation caveats (cont’d)

Waldron et al.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399–3406.

Cross-validation caveats (cont’d)

Cross-validation vs. cross-study validation in breast cancer prognosis

Cross-validation vs. cross-study validation in breast cancer prognosis

Bernau C et al.: Cross-study validation for the assessment of prediction algorithms. Bioinformatics 2014, 30:i105–12.

Permutation test

Permutation test

Steps of permutation test:

  1. Calculate test statistic (e.g. T) in observed sample
  2. Permutation:
    1. Sample without replacement the response values (\(Y\)), using the same \(X\)
    2. re-compute and store the test statistic T
    3. Repeat R times, store as a vector \(T_R\)
  3. Calculate empirical p value: proportion of permutation \(T_R\) that exceed actual T

Calculating a p-value

\[ P = \frac{sum \left( abs(T_R) > abs(T) \right)+ 1}{length(T_R) + 1} \]

Calculating a False Discovery Rate

Permutation test - pros and cons

Example from (sleep) data:

##      extra        group        ID   
##  Min.   :-1.600   1:10   1      :2  
##  1st Qu.:-0.025   2:10   2      :2  
##  Median : 0.950          3      :2  
##  Mean   : 1.540          4      :2  
##  3rd Qu.: 3.400          5      :2  
##  Max.   : 5.500          6      :2  
##                          (Other):8

t-test for difference in mean sleep

## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Permutation test instead of t-test

set.seed(1)
permT = function(){
  index = sample(1:nrow(sleep), replace=FALSE)
  t.test(extra ~ group[index], data=sleep)$statistic
}
Tr = replicate(999, permT())
(sum(abs(Tr) > abs(Tactual)) + 1) / (length(Tr) + 1)
## [1] 0.079

Bootstrap

The Bootstrap

Schematic of the Bootstrap

Schematic of the Bootstrap

ISLR Figure 5.11: Schematic of the bootstrap

Uses of the Bootstrap

How to perform the Bootstrap

Example: bootstrap in the sleep dataset

t.test(extra ~ group, data=sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Example: bootstrap in the sleep dataset

set.seed(2)
bootDiff = function(){
  boot = sleep[sample(1:nrow(sleep), replace = TRUE), ]
  mean(boot$extra[boot$group==1]) - 
    mean(boot$extra[boot$group==2])
}
bootR = replicate(1000, bootDiff())
bootR[match(c(25, 975), rank(bootR))]
## [1] -3.32083333  0.02727273

note: better to use library(boot)

Example: oral carcinoma recurrence risk

Reis PP, Waldron L, et al.: A gene signature in histologically normal surgical margins is predictive of oral carcinoma recurrence. BMC Cancer 2011, 11:437.

Example: oral carcinoma recurrence risk