Introduction to Linear Models

Levi Waldron, CUNY School of Public Health

June 14, 2017

Outline for Introduction to Linear Models

Based on:

  1. Love and Irizarry, Data Analysis for the Life Sciences, Chapter 5
  2. Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models (Statistics for Biology and Health). Springer; 2005.

Introduction to Linear Models

Example: friction of spider legs

Example: friction of spider legs

Example: friction of spider legs

table(spider$leg,spider$type)
##     
##      pull push
##   L1   34   34
##   L2   15   15
##   L3   52   52
##   L4   40   40
summary(spider)
##  leg        type        friction     
##  L1: 68   pull:141   Min.   :0.1700  
##  L2: 30   push:141   1st Qu.:0.3900  
##  L3:104              Median :0.7600  
##  L4: 80              Mean   :0.8217  
##                      3rd Qu.:1.2400  
##                      Max.   :1.8400

What are linear models?

Multiple linear regression model

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

Multiple linear regression model

Systematic plus random components of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

Assumptions of linear models: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Continuous predictors

Binary predictors (2 levels)

Multilevel categorical predictors (ordinal or nominal)

Model formulae in R

Model formulae in R

Model formulae tutorial

> response variable ~ explanatory variables

Regression with a single predictor

Model formula for simple linear regression:

> y ~ x

Return to the spider legs

Friction coefficient for leg type of first leg pair:

spider.sub <- spider[spider$leg=="L1", ]
fit <- lm(friction ~ type, data=spider.sub)
summary(fit)
## 
## Call:
## lm(formula = friction ~ type, data = spider.sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33147 -0.10735 -0.04941 -0.00147  0.76853 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.92147    0.03827  24.078  < 2e-16 ***
## typepush    -0.51412    0.05412  -9.499  5.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 66 degrees of freedom
## Multiple R-squared:  0.5776, Adjusted R-squared:  0.5711 
## F-statistic: 90.23 on 1 and 66 DF,  p-value: 5.698e-14

Regression on spider leg type

Regression coefficients for friction ~ type for first set of spider legs:

fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9215 0.0383 24.08 0.0000
typepush -0.5141 0.0541 -9.50 0.0000

Interpretation of coefficients

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the 'pull' samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

Diagram of the estimated coefficients in the linear model. The green arrow indicates the Intercept term, which goes from zero to the mean of the reference group (here the ‘pull’ samples). The orange arrow indicates the difference between the push group and the pull group, which is negative in this example. The circles show the individual samples, jittered horizontally to avoid overplotting.

Regression on spider leg position

Remember there are positions 1-4

fit <- lm(friction ~ leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6644 0.0538 12.34 0.0000
legL2 0.1719 0.0973 1.77 0.0784
legL3 0.1605 0.0693 2.32 0.0212
legL4 0.2813 0.0732 3.84 0.0002

Regression with multiple predictors

Additional explanatory variables can be added as follows:

> y ~ x + z

Note that “+” does not have its usual meaning, which would be achieved by:

> y ~ I(x + z)

Regression on spider leg type and position

Remember there are positions 1-4

fit <- lm(friction ~ type + leg, data=spider)
fit.table <- xtable::xtable(fit, label=NULL)
print(fit.table, type="html")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0539 0.0282 37.43 0.0000
typepush -0.7790 0.0248 -31.38 0.0000
legL2 0.1719 0.0457 3.76 0.0002
legL3 0.1605 0.0325 4.94 0.0000
legL4 0.2813 0.0344 8.18 0.0000

Interaction (effect modification)

Interaction between coffee and time of day on performance

Interaction between coffee and time of day on performance

Image credit: http://personal.stevens.edu/~ysakamot/

Interaction (effect modification)

Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]

Summary: model formulae

symbol example meaning
+ + x include this variable
- - x delete this variable
: x : z include the interaction
* x * z include these variables and their interactions
^ (u + v + w)^3 include these variables and all interactions up to three way
1 -1 intercept: delete the intercept

Summary: types of standard linear models

lm( y ~ u + v)

u and v factors: ANOVA
u and v numeric: multiple regression
one factor, one numeric: ANCOVA

The Design Matrix

The Design Matrix

Recall the multiple linear regression model:

\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)

The Design Matrix

Matrix notation for the multiple linear regression model:

\[ \, \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]

or simply:

\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]

Choice of design matrix

group <- factor( c(1, 1, 2, 2) )
model.matrix(~ group)
##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      1
## 4           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Choice of design matrix

What if we forgot to code group as a factor?

group <- c(1, 1, 2, 2)
model.matrix(~ group)
##   (Intercept) group
## 1           1     1
## 2           1     1
## 3           1     2
## 4           1     2
## attr(,"assign")
## [1] 0 1

More groups, still one variable

group <- factor(c(1,1,2,2,3,3))
model.matrix(~ group)
##   (Intercept) group2 group3
## 1           1      0      0
## 2           1      0      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Changing the baseline group

group <- factor(c(1,1,2,2,3,3))
group <- relevel(x=group, ref=3)
model.matrix(~ group)
##   (Intercept) group1 group2
## 1           1      1      0
## 2           1      1      0
## 3           1      0      1
## 4           1      0      1
## 5           1      0      0
## 6           1      0      0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

More than one variable

diet <- factor(c(1,1,1,1,2,2,2,2))
sex <- factor(c("f","f","m","m","f","f","m","m"))
model.matrix(~ diet + sex)
##   (Intercept) diet2 sexm
## 1           1     0    0
## 2           1     0    0
## 3           1     0    1
## 4           1     0    1
## 5           1     1    0
## 6           1     1    0
## 7           1     1    1
## 8           1     1    1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

With an interaction term

model.matrix(~ diet + sex + diet:sex)
##   (Intercept) diet2 sexm diet2:sexm
## 1           1     0    0          0
## 2           1     0    0          0
## 3           1     0    1          0
## 4           1     0    1          0
## 5           1     1    0          0
## 6           1     1    0          0
## 7           1     1    1          1
## 8           1     1    1          1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$diet
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

Design matrix to contrast what we want

fitX <- lm(friction ~ type * leg, data=spider)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9215 0.0327 28.21 0.0000
typepush -0.5141 0.0462 -11.13 0.0000
legL2 0.2239 0.0590 3.79 0.0002
legL3 0.3524 0.0420 8.39 0.0000
legL4 0.4793 0.0444 10.79 0.0000
typepush:legL2 -0.1039 0.0835 -1.24 0.2144
typepush:legL3 -0.3838 0.0594 -6.46 0.0000
typepush:legL4 -0.3959 0.0628 -6.30 0.0000

**What if we want to ask this question for L3 vs L2?

Design matrix to contrast what we want

What if we want to contrast…

typepush:legL3 - typepush:legL2

There are many ways to construct this design, one is with library(multcomp):

names(coef(fitX))
## [1] "(Intercept)"    "typepush"       "legL2"          "legL3"         
## [5] "legL4"          "typepush:legL2" "typepush:legL3" "typepush:legL4"
C <- matrix(c(0,0,0,0,0,-1,1,0), 1) 
L3vsL2interaction <- multcomp::glht(fitX, linfct=C) 

Design matrix to contrast what we want

Is there a difference in pushing friction for L3 vs L2?

summary(L3vsL2interaction)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: lm(formula = friction ~ type * leg, data = spider)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0 -0.27988    0.07893  -3.546  0.00046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Summary: applications of model matrices

Generalized Linear Models

Generalized Linear Models

Components of a GLM

\[ g\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]

Linear Regression as GLM

Logistic Regression as GLM

\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]

\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]

Log-linear GLM

The systematic part of the GLM is:

\[ log\left( E[y|x] \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + log(t_i) \]

Poisson error model

\[ f(k, \lambda) = e^{-\lambda} \frac{\lambda^k}{k!} \]

Negative Binomial error model

\[ f(k, \lambda, \theta) = \frac{\Gamma(\frac{1 + \theta k}{\theta})}{k! \, \Gamma(\frac{1}{\theta})} \left(\frac{\theta m}{1+\theta m}\right)^k \left(1+\theta m\right)^\theta \quad\text{for }k = 0, 1, 2, \dotsc \] * where \(f\) is still the probability of \(k\) events (e.g. # of reads counted), * \(\lambda\) is still the mean number of events, so \(E[y|x]\) * An additional dispersion parameter \(\theta\) is estimated: + \(\theta \rightarrow 0\): Poisson distribution + \(\theta \rightarrow \infty\): Gamma distribution
* The Poisson model can be considered as nested within the Negative Binomial model + A likelihood ratio test comparing the two models is possible

Zero Inflation

Additive vs. multiplicative models

Summary