Contents

Original Authors: Martin Morgan, Sonali Arora
Presenting Authors: Martin Morgan, Lori Shepherd
Date: 22 July, 2019
Back: Monday labs

Objective: Gain confidence working with base R commands and data structures.

Lessons learned:

1 R

1.1 Language and environment for statistical computing and graphics

  • Full-featured programming language
  • Interactive and interpretted – convenient and forgiving
  • Coherent, extensive documentation
  • Statistical, e.g. factor(), NA
  • Extensible – CRAN, Bioconductor, github, …

1.2 Vector, class, object

  • Efficient vectorized calculations on ‘atomic’ vectors logical, integer, numeric, complex, character, byte
  • Atomic vectors are building blocks for more complicated objects
    • matrix – atomic vector with ‘dim’ attribute
    • data.frame – list of equal length atomic vectors
  • Formal classes represent complicated combinations of vectors, e.g., the return value of lm(), below

1.3 Function, generic, method

  • Functions transform inputs to outputs, perhaps with side effects, e.g., rnorm(1000)
    • Argument matching first by name, then by position
    • Functions may define (some) arguments to have default values
  • Generic functions dispatch to specific methods based on class of argument(s), e.g., print().
  • Methods are functions that implement specific generics, e.g., print.factor; methods are invoked indirectly, via the generic.
  • Many but not all functions able to manipulate a particular class are methods, e.g., abline() used below is a plain-old-funciton.

1.4 Programming

Iteration:

  • lapply()

    args(lapply)
    ## function (X, FUN, ...) 
    ## NULL
    • Meaning: for a vector X (typically a list()), apply a function FUN to each vector element, returning the result as a list. ... are additional arguments to FUN.
    • FUN can be built-in, or a user-defined function
    lst <- list(a=1:2, b=2:4)
    lapply(lst, log)      # 'base' argument default; natural log
    ## $a
    ## [1] 0.0000000 0.6931472
    ## 
    ## $b
    ## [1] 0.6931472 1.0986123 1.3862944
    lapply(lst, log, 10)  # '10' is second argument to 'log()', i.e., log base 10
    ## $a
    ## [1] 0.00000 0.30103
    ## 
    ## $b
    ## [1] 0.3010300 0.4771213 0.6020600
    • sapply() – like lapply(), but simplify the result to a vector, matrix, or array, if possible.
    • vapply() – like sapply(), but requires that the return type of FUN is specified; this can be safer – an error when the result is of an unexpected type.
  • mapply() (also Map())

    args(mapply)
    ## function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) 
    ## NULL
    • ... are one or more vectors, recycled to be of the same length. FUN is a function that takes as many arguments as there are components of .... mapply returns the result of applying FUN to the elements of the vectors in ....

      mapply(seq, 1:3, 4:6, SIMPLIFY=FALSE) # seq(1, 4); seq(2, 5); seq(3, 6)
      ## [[1]]
      ## [1] 1 2 3 4
      ## 
      ## [[2]]
      ## [1] 2 3 4 5
      ## 
      ## [[3]]
      ## [1] 3 4 5 6
    • apply()

      args(apply)
      ## function (X, MARGIN, FUN, ...) 
      ## NULL
    • For a matrix or array X, apply FUN to each MARGIN (dimension, e.g., MARGIN=1 means apply FUN to each row, MARGIN=2 means apply FUN to each column)

  • Traditional iteration programming constructs repeat {}, for () {}

    • Almost always more error-prone, less efficient, and harder to understand than lapply() !

Conditional

if (test) {
    ## code if TEST == TRUE
} else {
    ## code if TEST == FALSE
}

Functions (see table below for a few favorites)

  • Easy to define your own functions
fun <- function(x) {
    length(unique(x))
}
## list of length 5, each containsing a sample (with replacement) of letters
lets <- replicate(5, sample(letters, 50, TRUE), simplify=FALSE)
sapply(lets, fun)
## [1] 22 22 17 24 20

1.5 Introspection & Help

Introspection

  • General properties, e.g., class(), str()
  • Class-specific properties, e.g., dim()

Help

  • ?"print": help on the generic print
  • ?"print.data.frame": help on print method for objects of class data.frame.
  • help(package="GenomeInfoDb")
  • browseVignettes("GenomicRanges")
  • methods("plot")
  • methods(class="lm")

1.6 Examples

R vectors, vectorized operations, data.frame(), formulas, functions, objects, class and method discovery (introspection).

x <- rnorm(1000)                     # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)           # object of class 'data.frame'
plot(y ~ x, df)                      # generic plot, method plot.formula
fit <- lm(y ~x, df)                  # object of class 'lm'
methods(class=class(fit))            # introspection
##  [1] add1           alias          anova          case.names    
##  [5] coerce         confint        cooks.distance deviance      
##  [9] dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula       
## [17] hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print         
## [29] proj           qr             residuals      rstandard     
## [33] rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code
anova(fit)
## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## x           1 1026.55 1026.55    4236 < 2.2e-16 ***
## Residuals 998  241.85    0.24                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(y ~ x, df)                      # methods(plot); ?plot.formula
abline(fit, col="red", lwd=3, lty=2) # a function, not generic.method

Programming example – group 1000 SYMBOLs into GO identifiers

## example data
fl <- file.choose()      ## symgo.csv
symgo <- read.csv(fl, row.names=1, stringsAsFactors=FALSE)
head(symgo)
##      SYMBOL         GO EVIDENCE ONTOLOGY
## 1   PPIAP28       <NA>     <NA>     <NA>
## 2     PTLAH       <NA>     <NA>     <NA>
## 3 HIST1H2BC GO:0000786      NAS       CC
## 4 HIST1H2BC GO:0000788      IBA       CC
## 5 HIST1H2BC GO:0002227      IDA       BP
## 6 HIST1H2BC GO:0003677      IBA       MF
dim(symgo)
## [1] 5041    4
length(unique(symgo$SYMBOL))
## [1] 1000
## split-sapply
go2sym <- split(symgo$SYMBOL, symgo$GO)
len1 <- sapply(go2sym, length)          # compare with lapply, vapply
## built-in functions for common actions
len2 <- lengths(go2sym)
identical(len1, len2)
## [1] TRUE
## smarter built-in functions, e.g., omiting NAs
len3 <- aggregate(SYMBOL ~ GO, symgo, length)
head(len3)
##           GO SYMBOL
## 1 GO:0000049      3
## 2 GO:0000050      2
## 3 GO:0000060      1
## 4 GO:0000077      1
## 5 GO:0000086      3
## 6 GO:0000118      1
## more fun with aggregate()
head(aggregate(GO ~ SYMBOL, symgo, length))
##     SYMBOL GO
## 1    ABCD4 15
## 2    ABCG2 22
## 3      ACE 57
## 4 ADAMTSL2  6
## 5  ALDH1L2 11
## 6    ALOX5 19
head(aggregate(SYMBOL ~ GO, symgo, c))
##           GO                SYMBOL
## 1 GO:0000049  YARS2, YARS2, EEF1A1
## 2 GO:0000050              ASL, ASL
## 3 GO:0000060                 OPRD1
## 4 GO:0000077                 PEA15
## 5 GO:0000086 TUBB4A, CENPF, CLASP1
## 6 GO:0000118                  CIR1
## your own function -- unique, lower-case identifiers
uidfun  <- function(x) {
    unique(tolower(x))
}
head(aggregate(SYMBOL ~ GO , symgo, uidfun))
##           GO                SYMBOL
## 1 GO:0000049         yars2, eef1a1
## 2 GO:0000050                   asl
## 3 GO:0000060                 oprd1
## 4 GO:0000077                 pea15
## 5 GO:0000086 tubb4a, cenpf, clasp1
## 6 GO:0000118                  cir1
## as an 'anonymous' function
head(aggregate(SYMBOL ~ GO, symgo, function(x) {
    unique(tolower(x))
}))
##           GO                SYMBOL
## 1 GO:0000049         yars2, eef1a1
## 2 GO:0000050                   asl
## 3 GO:0000060                 oprd1
## 4 GO:0000077                 pea15
## 5 GO:0000086 tubb4a, cenpf, clasp1
## 6 GO:0000118                  cir1

2 Case studies

These case studies serve as refreshers on R input and manipulation of data.

2.1 ALL phenotypic data

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you’ve created, for instance…

class(pdata)
## [1] "data.frame"
colnames(pdata)
##  [1] "id"             "diagnosis"      "sex"            "age"           
##  [5] "BT"             "remission"      "CR"             "date.cr"       
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"         
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"         
## [17] "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"
dim(pdata)
## [1] 127  21
head(pdata)
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr relapse
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE   FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE    TRUE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE    TRUE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE    TRUE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE    TRUE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE    TRUE
##   transplant               f.u date.last.seen
## 1       TRUE BMT / DEATH IN CR           <NA>
## 2      FALSE               REL      8/28/2000
## 3      FALSE               REL     10/15/1999
## 4      FALSE               REL      1/23/1998
## 5      FALSE               REL      11/4/1997
## 6      FALSE               REL     12/15/1997
summary(pdata$sex)
##    F    M NA's 
##   42   83    2
summary(pdata$cyto.normal)
##    Mode   FALSE    TRUE    NA's 
## logical      69      24      34

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]
##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57
pdata[1:5, ]
##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr relapse
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE   FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE    TRUE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE    TRUE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE    TRUE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE    TRUE
##   transplant               f.u date.last.seen
## 1       TRUE BMT / DEATH IN CR           <NA>
## 2      FALSE               REL      8/28/2000
## 3      FALSE               REL     10/15/1999
## 4      FALSE               REL      1/23/1998
## 5      FALSE               REL      11/4/1997
head(pdata[, 3:5])
##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1
tail(pdata[, 3:5], 3)
##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
##      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1  1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 3  3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4  4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5  4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 10 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE
## 11 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE
##    cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1        FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 3           NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4        FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5        FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 10       FALSE     del(p15)  BCR/ABL           p190 NEG    <NA> FALSE
## 11       FALSE del(p15/p16)  BCR/ABL      p190/p210 NEG dyploid FALSE
##    relapse transplant               f.u date.last.seen
## 1    FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3     TRUE      FALSE               REL     10/15/1999
## 4     TRUE      FALSE               REL      1/23/1998
## 5     TRUE      FALSE               REL      11/4/1997
## 10    TRUE      FALSE               REL      7/11/1997
## 11   FALSE       TRUE BMT / DEATH IN CR           <NA>

It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata to contain just those individuals 19 rows are selected. Why? What can we do to correct this?

idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
## idx
## FALSE  TRUE 
##   108    17
dim(pdata[idx,])
## [1] 19 21

Use the mol.biol column to subset the data to contain just individuals with ‘BCR/ABL’ or ‘NEG’, e.g.,

bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

bcrabl$mol.biol <- factor(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT)
##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

table(bcrabl$BT)
## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  4  9 35 22  9  4  1 15  9  2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
## 
##  B  T 
## 79 31

Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

xtabs(~ BT + mol.biol, bcrabl)
##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  31

Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

aggregate(age ~ mol.biol + sex, bcrabl, mean)
##   mol.biol sex      age
## 1  BCR/ABL   F 39.93750
## 2      NEG   F 30.42105
## 3  BCR/ABL   M 40.50000
## 4      NEG   M 27.21154

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

t.test(age ~ mol.biol, bcrabl)
## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##              40.25000              28.07042
boxplot(age ~ mol.biol, bcrabl)

2.2 Weighty matters

This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.

Input the data using read.csv(), creating a variable brfss to hold it. Use file.choose() to locate the data file BRFSS-subset.csv

fname <- file.choose()   ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)

Base plotting functions

  1. Explore the data using class(), dim(), head(), summary(), etc. Use xtabs() to summarize the number of males and females in the study, in each of the two years.

  2. Use aggregate() to summarize the average weight in each sex and year.

  3. Create a scatterplot showing the relationship between the square root of weight and height, using the plot() function and the main argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points) to view different points).

    plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")