Original Authors: Martin Morgan, Sonali Arora
Presenting Author: Martin Morgan (martin.morgan@roswellpark.org)
Date: 11 July, 2016
Back: Monday labs
Objective: Gain confidence working with base R commands and data structures.
Lessons learned:
factor()
, NA
logical
, integer
, numeric
, complex
, character
, byte
matrix
– atomic vector with ‘dim’ attributedata.frame
– list of equal length atomic vectorslm()
, belowrnorm(1000)
print()
.print.factor
; methods are invoked indirectly, via the generic.abline()
used below is a plain-old-funciton.Iteration:
lapply()
args(lapply)
## function (X, FUN, ...)
## NULL
X
(typically a list()
), apply a function FUN
to each vector element, returning the result as a list. ...
are additional arguments to FUN
.FUN
can be built-in, or a user-defined function
lst <- list(a=1:2, b=2:4)
lapply(lst, log) # 'base' argument default; natural log
## $a
## [1] 0.0000000 0.6931472
##
## $b
## [1] 0.6931472 1.0986123 1.3862944
lapply(lst, log, 10) # '10' is second argument to 'log()', i.e., log base 10
## $a
## [1] 0.00000 0.30103
##
## $b
## [1] 0.3010300 0.4771213 0.6020600
sapply()
– like lapply()
, but simplify the result to a vector, matrix, or array, if possible.vapply()
– like sapply()
, but requires that the return type of FUN
is specified; this can be safer – an error when the result is of an unexpected type.
mapply()
(also Map()
)
args(mapply)
## function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
## NULL
...
are one or more vectors, recycled to be of the same length. FUN
is a function that takes as many arguments as there are components of ...
. mapply
returns the result of applying FUN
to the elements of the vectors in ...
.
mapply(seq, 1:3, 4:6, SIMPLIFY=FALSE) # seq(1, 4); seq(2, 5); seq(3, 6)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 2 3 4 5
##
## [[3]]
## [1] 3 4 5 6
apply()
args(apply)
## function (X, MARGIN, FUN, ...)
## NULL
For a matrix or array X
, apply FUN
to each MARGIN
(dimension, e.g., MARGIN=1
means apply FUN
to each row, MARGIN=2
means apply FUN
to each column)
Traditional iteration programming constructs repeat {}
, for () {}
Almost always more error-prone, less efficient, and harder to understand than lapply()
!
Conditional
if (test) {
## code if TEST == TRUE
} else {
## code if TEST == FALSE
}
Functions (see table below for a few favorites)
fun <- function(x) {
length(unique(x))
}
## list of length 5, each containsing a sample (with replacement) of letters
lets <- replicate(5, sample(letters, 50, TRUE), simplify=FALSE)
sapply(lets, fun)
## [1] 19 22 23 24 24
Introspection
class()
, str()
dim()
Help
?"print"
: help on the generic print?"print.data.frame"
: help on print method for objects of class data.frame.help(package="GenomeInfoDb")
browseVignettes("GenomicRanges")
methods("plot")
methods(class="lm")
R vectors, vectorized operations, data.frame()
, formulas, functions, objects, class and method discovery (introspection).
x <- rnorm(1000) # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y) # object of class 'data.frame'
plot(y ~ x, df) # generic plot, method plot.formula
fit <- lm(y ~x, df) # object of class 'lm'
methods(class=class(fit)) # introspection
## [1] add1 alias anova case.names
## [5] coerce confint cooks.distance deviance
## [9] dfbeta dfbetas drop1 dummy.coef
## [13] effects extractAIC family formula
## [17] fortify hatvalues influence initialize
## [21] kappa labels logLik model.frame
## [25] model.matrix nobs plot predict
## [29] print proj qr residuals
## [33] rstandard rstudent show simulate
## [37] slotsFromS3 summary variable.names vcov
## see '?methods' for accessing help and source code
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 913.25 913.25 3581.5 < 2.2e-16 ***
## Residuals 998 254.48 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(y ~ x, df) # methods(plot); ?plot.formula
abline(fit, col="red", lwd=3, lty=2) # a function, not generic.method
Programming example – group 1000 SYMBOLs into GO identifiers
## example data
fl <- file.choose() ## symgo.csv
symgo <- read.csv(fl, row.names=1, stringsAsFactors=FALSE)
head(symgo)
## SYMBOL GO EVIDENCE ONTOLOGY
## 1 PPIAP28 <NA> <NA> <NA>
## 2 PTLAH <NA> <NA> <NA>
## 3 HIST1H2BC GO:0000786 NAS CC
## 4 HIST1H2BC GO:0000788 IBA CC
## 5 HIST1H2BC GO:0002227 IDA BP
## 6 HIST1H2BC GO:0003677 IBA MF
dim(symgo)
## [1] 5041 4
length(unique(symgo$SYMBOL))
## [1] 1000
## split-sapply
go2sym <- split(symgo$SYMBOL, symgo$GO)
len1 <- sapply(go2sym, length) # compare with lapply, vapply
## built-in functions for common actions
len2 <- lengths(go2sym)
identical(len1, len2)
## [1] TRUE
## smarter built-in functions, e.g., omiting NAs
len3 <- aggregate(SYMBOL ~ GO, symgo, length)
head(len3)
## GO SYMBOL
## 1 GO:0000049 3
## 2 GO:0000050 2
## 3 GO:0000060 1
## 4 GO:0000077 1
## 5 GO:0000086 3
## 6 GO:0000118 1
## more fun with aggregate()
head(aggregate(GO ~ SYMBOL, symgo, length))
## SYMBOL GO
## 1 ABCD4 15
## 2 ABCG2 22
## 3 ACE 57
## 4 ADAMTSL2 6
## 5 ALDH1L2 11
## 6 ALOX5 19
head(aggregate(SYMBOL ~ GO, symgo, c))
## GO SYMBOL
## 1 GO:0000049 YARS2, YARS2, EEF1A1
## 2 GO:0000050 ASL, ASL
## 3 GO:0000060 OPRD1
## 4 GO:0000077 PEA15
## 5 GO:0000086 TUBB4A, CENPF, CLASP1
## 6 GO:0000118 CIR1
## your own function -- unique, lower-case identifiers
uidfun <- function(x) {
unique(tolower(x))
}
head(aggregate(SYMBOL ~ GO , symgo, uidfun))
## GO SYMBOL
## 1 GO:0000049 yars2, eef1a1
## 2 GO:0000050 asl
## 3 GO:0000060 oprd1
## 4 GO:0000077 pea15
## 5 GO:0000086 tubb4a, cenpf, clasp1
## 6 GO:0000118 cir1
## as an 'anonymous' function
head(aggregate(SYMBOL ~ GO, symgo, function(x) {
unique(tolower(x))
}))
## GO SYMBOL
## 1 GO:0000049 yars2, eef1a1
## 2 GO:0000050 asl
## 3 GO:0000060 oprd1
## 4 GO:0000077 pea15
## 5 GO:0000086 tubb4a, cenpf, clasp1
## 6 GO:0000118 cir1
These case studies serve as refreshers on R input and manipulation of data.
Input a file that contains ALL (acute lymphoblastic leukemia) patient information
fname <- file.choose() ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
Check out the help page ?read.delim
for input options, and explore basic properties of the object you’ve created, for instance…
class(pdata)
## [1] "data.frame"
colnames(pdata)
## [1] "id" "diagnosis" "sex" "age"
## [5] "BT" "remission" "CR" "date.cr"
## [9] "t.4.11." "t.9.22." "cyto.normal" "citog"
## [13] "mol.biol" "fusion.protein" "mdr" "kinet"
## [17] "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
dim(pdata)
## [1] 127 21
head(pdata)
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 6 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 6 FALSE complex alt. NEG <NA> NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 6 TRUE FALSE REL 12/15/1997
summary(pdata$sex)
## F M NA's
## 42 83 2
summary(pdata$cyto.normal)
## Mode FALSE TRUE NA's
## logical 69 24 34
Remind yourselves about various ways to subset and access columns of a data.frame
pdata[1:5, 3:4]
## sex age
## 1 M 53
## 2 M 19
## 3 F 52
## 4 M 38
## 5 M 57
pdata[1:5, ]
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
head(pdata[, 3:5])
## sex age BT
## 1 M 53 B2
## 2 M 19 B2
## 3 F 52 B4
## 4 M 38 B1
## 5 M 57 B2
## 6 M 17 B1
tail(pdata[, 3:5], 3)
## sex age BT
## 125 M 19 T2
## 126 M 30 T3
## 127 M 29 T2
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 10 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE
## 11 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 10 FALSE del(p15) BCR/ABL p190 NEG <NA> FALSE
## 11 FALSE del(p15/p16) BCR/ABL p190/p210 NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 10 TRUE FALSE REL 7/11/1997
## 11 FALSE TRUE BMT / DEATH IN CR <NA>
It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata
to contain just those individuals 19 rows are selected. Why? What can we do to correct this?
idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
## idx
## FALSE TRUE
## 108 17
dim(pdata[idx,])
## [1] 19 21
Use the mol.biol
column to subset the data to contain just individuals with ‘BCR/ABL’ or ‘NEG’, e.g.,
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
The mol.biol
column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
The BT
column is a factor describing B- and T-cell subtypes
levels(bcrabl$BT)
## [1] "B" "B1" "B2" "B3" "B4" "T" "T1" "T2" "T3" "T4"
How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T
table(bcrabl$BT)
##
## B B1 B2 B3 B4 T T1 T2 T3 T4
## 4 9 35 22 9 4 1 15 9 2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
##
## B T
## 79 31
Use xtabs()
(cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups
xtabs(~ BT + mol.biol, bcrabl)
## mol.biol
## BT BCR/ABL NEG
## B 37 42
## T 0 31
Use aggregate()
to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.
aggregate(age ~ mol.biol + sex, bcrabl, mean)
## mol.biol sex age
## 1 BCR/ABL F 39.93750
## 2 NEG F 30.42105
## 3 BCR/ABL M 40.50000
## 4 NEG M 27.21154
Use t.test()
to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot()
. In both cases, use the formula
interface. Consult the help page ?t.test
and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?
t.test(age ~ mol.biol, bcrabl)
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25000 28.07042
boxplot(age ~ mol.biol, bcrabl)
This case study is a second walk through basic data manipulation and visualization skills. We use data from the US Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 10000 observations from each of 1990 and 2010.
Input the data using read.csv()
, creating a variable brfss
to hold it. Use file.choose()
to locate the data file BRFSS-subset.csv
fname <- file.choose() ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read.csv(fname)
Base plotting functions
Explore the data using class()
, dim()
, head()
, summary()
, etc. Use xtabs()
to summarize the number of males and females in the study, in each of the two years.
Use aggregate()
to summarize the average weight in each sex and year.
Create a scatterplot showing the relationship between the square root of weight and height, using the plot()
function and the main
argument to annotate the plot. Note the transformed Y-axis. Experiment with different plotting symbols (try the command example(points)
to view different points).
plot(sqrt(Weight) ~ Height, brfss, main="All Years, Both Sexes")
Color the female and male points differently. To do this, use the col
argument to plot()
. Provide as a value to that argument a vector of colors, subset by brfss$Sex
.
brfss2010 <- brfss[brfss$Year == "2010", ]
Create the figure below (two panels in a single figure). Do this by using the par()
function with the mfcol
argument before calling plot()
. You’ll need to create two more subsets of data, perhaps when you are providing the data to the function plot
.
opar <- par(mfcol=c(1, 2))
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Female", ],
main="2010, Female")
plot(sqrt(Weight) ~ Height, brfss2010[brfss2010$Sex == "Male", ],
main="2010, Male")
par(opar) # reset 'par' to original value
Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Experiment with arguments to plot()
to address over-plotting, e.g., pch='.'
or alpha=.4
. Try using the smoothScatter()
function (the data have to be presented as x
and y
, rather than as a formula). Try adding the hexbin library to your R session (using library()
) and creating a hexbinplot()
.
ggplot2 graphics
Create a scatterplot showing the relationship between the square root of weight and height, using the ggplot2 library, and the annotate the plot. Two equivalent ways to create the plot are show in the solution.
library(ggplot2)
## 'quick' plot
qplot(Height, sqrt(Weight), data=brfss)
## Warning: Removed 735 rows containing missing values (geom_point).
## specify the data set and 'aesthetics', then how to plot
ggplot(brfss, aes(x=Height, y=sqrt(Weight))) +
geom_point()
## Warning: Removed 735 rows containing missing values (geom_point).
qplot()
gives us a warning which states that it has removed rows containing missing values. This is actually very helpful because we find out that our dataset contains NA
’s and we can take a design decision here about what we’d like to do these NA
’s. We can find the indicies of the rows containing NA
using is.na()
, and count the number of rows with NA
values using sum()
:
sum(is.na(brfss$Height))
## [1] 184
sum(is.na(brfss$Weight))
## [1] 649
drop <- is.na(brfss$Height) | is.na(brfss$Weight)
sum(drop)
## [1] 735
Remove the rows which contain NA
’s in Height and Weight.
brfss <- brfss[!drop,]
Plot is annotated with
qplot(Height, sqrt(Weight), data=brfss) +
ylab("Square root of Weight") +
ggtitle("All Years, Both Sexes")
Color the female and male points differently.
ggplot(brfss, aes(x=Height, y=sqrt(Weight), color=Sex)) +
geom_point()
One can also change the shape of the points for the female and male groups
ggplot(brfss, aes(x=Height, y = sqrt(Weight), color=Sex, shape=Sex)) +
geom_point()
or plot Male and Female in different panels using facet_grid()
ggplot(brfss, aes(x=Height, y = sqrt(Weight), color=Sex)) +
geom_point() +
facet_grid(Sex ~ .)
Create a subset of the data containing only observations from 2010 and make density curves for male and female groups. Use the fill
aesthetic to indicate that each sex is to be calculated separately, and geom_density()
for the density plot.
brfss2010 <- brfss[brfss$Year == "2010", ]
ggplot(brfss2010, aes(x=sqrt(Weight), fill=Sex)) +
geom_density(alpha=.25)
## Warning in grid.Call.graphics(L_polygon, x$x, x$y, index): semi-
## transparency is not supported on this device: reported only once per page
Plotting large numbers of points means that they are often over-plotted, potentially obscuring important patterns. Make the points semi-transparent using alpha. Here we make them 60% transparent. The solution illustrates a nice feature of ggplot2 – a partially specified plot can be assigned to a variable, and the variable modified at a later point.
sp <- ggplot(brfss, aes(x=Height, y=sqrt(Weight)))
sp + geom_point(alpha=.4)
## Warning in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size): semi-
## transparency is not supported on this device: reported only once per page
Add a fitted regression model to the scatter plot.
sp + geom_point() + stat_smooth(method=lm)
## Warning in grid.Call.graphics(L_polygon, x$x, x$y, index): semi-
## transparency is not supported on this device: reported only once per page
By default, stat_smooth()
also adds a 95% confidence region for the regression fit. The confidence interval can be changed by setting level, or it can be disabled with se=FALSE
.
sp + geom_point() + stat_smooth(method=lm + level=0.95)
sp + geom_point() + stat_smooth(method=lm, se=FALSE)
How do you fit a linear regression line for each group? First we’ll make the base plot object sps, then we’ll add the linear regression lines to it.
sps <- ggplot(brfss, aes(x=Height, y=sqrt(Weight), colour=Sex)) +
geom_point() +
scale_colour_brewer(palette="Set1")
sps + geom_smooth(method="lm")
## Warning in grid.Call.graphics(L_polygon, x$x, x$y, index): semi-
## transparency is not supported on this device: reported only once per page