--- title: "A.2 -- Data Input and Manipulation" author: "Martin Morgan " date: "11 - 12 September 2017" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A.2 -- Data Input and Manipulation} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) suppressPackageStartupMessages({ library(tidyverse) }) ``` # The tidyverse We'll adopt a particular approach to data input, wrangling, and basic analysis known as the 'tidyverse'. Start by loading the [tidyverse][] package. ```{r} library(tidyverse) ``` We'll cover the following functions: - Data input - `read_csv()`: data input from a comma-separate value file, as a (`data.frame`-like) `tibble`. - Pipes - `%>%`: 'pipe' data from a source to a function. - `%$%`: extract a column in a pipe. - `.`: refer to the incoming data. - Data manipulation - `group_by()`: define groups of rows based on column values - `summarize()`: apply functions to groups of data to produce a summary of the data. - `filter()`: filter rows to match criteria in columns - `select()`: select columns for subsequent use - `mutate()`: update or add columns of data. - Other functions and concepts - `%in%`: identify elements of the left-hand vector that are elements of the set defined by the right-hand vector. - `t.test()`: perform a t-test. - `boxplot()`, `hist()`: basic visualization. - `~`: specify a formula describing the relationship between a dependent (left-hand side) variable and independent (right-hand side) variable(s). [tidyverse]: https://cran.r-project.org/package=tidyverse [magrittr]: https://cran.r-project.org/package=magrittr # Exercise 1: BRFSS Survey Data We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System ([BRFSS][]) telephone survey. Check out the link for more information. We'll look at a subset of the data. 1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv' ```{r file.choose, eval=FALSE} path <- file.choose() ``` 2. Input the data using `read_csv()`, assigning to a variable `brfss` and visualizing the first few rows. ```{r read.csv} brfss <- read_csv(path) brfss ``` 3. From looking at the data... - How many individuals aer in the sample - What variables have been measured? - Can you guess at the units used for, e.g., Weight and Height? 4. The tidyverse uses a 'pipe', `%>%` to send data from one command to another. There are small number of key functions for manipulating data. We'll use `group_by()` to group the data by `Sex`, and then `summarize(n=n())` to count the number of observations in each group. ```{r brfss-sex} brfss %>% group_by(Sex) %>% summarize(N = n()) ``` 5. Use `group_by(Year, Sex)` and `summarize(N = n())` to summarize the number of individuals from each year and sex. ```{r brfss-sex-year} brfss %>% group_by(Year, Sex) %>% summarize(N = n()) ``` 6. Calculate the average age in each year and sex by adding the argument `Age = mean(Age, na.rm=TRUE)` to `summarize()` ```{r brfss-mean-age} brfss %>% group_by(Year, Sex) %>% summarize(N = n(), Age = mean(Age, na.rm=TRUE)) ``` 6. `Year` is input as an integer vector, and Sex as a character vector. Actually, though, these are both factors. Use `mutate()` and `factor()` to update the type of these columns. Re-assign the updated tibble to `brfss` ```{r brfss-mutate} brfss %>% mutate(Year = factor(Year), Sex = factor(Sex)) brfss <- brfss %>% mutate(Year = factor(Year), Sex = factor(Sex)) ``` 7. There are several other pipes available (see also the [magrittr][] package). `%$%` extracts a column. Here we look at the `levels()` of the factor that we created. ```{r brfss-levels} brfss %$% Sex %>% levels() brfss %$% Year%>% levels() ``` 8. It's usually better to 'clean' data as soon as possible. Visit the help page `?read_csv`, look at the `col_types =` argument, and the help pages `?cols` and `?col_factor`. Input the data in it's correct format, with Sex and Year as factors ```{r brfss-read-cols} col_types <- cols( Age = col_integer(), Weight = col_double(), Sex = col_factor(c("Female", "Male")), Height = col_double(), Year = col_factor(c("1990", "2010")) ) brfss <- read_csv(path, col_types = col_types) brfss brfss %>% summary() ``` 9. Use `filter()` to create a subset of the data consisting of only the 1990 observations (`Year` in the set that consists of the single element `1990`, `Year %in% 1990`). Optionally, save this to a new variable `brfss_1990`. ```{r filter} brfss %>% filter(Year %in% 1990) brfss_1990 <- brfss %>% filter(Year %in% 1990) ``` 10. Pipe this subset to `t.test()` to ask whether Weight depends on Sex. The first argument to `t.test` is a 'formula' describing the relation between dependent and independent variables; we use the formula `Weight ~ Sex`. The second argument to `t.test` is the data set to use -- indicate the data from the pipe with `data = .` ```{r t-test-1990} brfss %>% filter(Year %in% 1990) %>% t.test(Weight ~ Sex, data = .) ``` What about differences between weights of males (or females) in 1990 versus 2010? 11. Use `boxplot()` to plot the weights of the Male individuals. Can you transform weight, e.g., taking the square root, before plotting? Interpret the results. Do similar boxplots for the t-tests of the previous question. ```{r brfss-boxplot, fig.width=5, fig.height=5} brfss %>% filter(Sex %in% "Male") %>% boxplot(Weight ~ Year, data = .) brfss %>% filter(Sex %in% "Male") %>% mutate(SqrtWeight = sqrt(Weight)) %>% boxplot(SqrtWeight ~ Year, data = .) ``` 12. Use `hist()` to plot a histogram of weights of the 1990 Female individuals. From `?hist`, the function is expecting a vector of values, so use `%$%` to select the `Weight` column and pipe to `hist()`. ```{r brfss-hist, fig.width=5, fig.height=5} brfss %>% filter(Year %in% "1990", Sex %in% "Female") %$% Weight %>% hist(main="1990 Female Weight") ``` [BRFSS]: http://www.cdc.gov/brfss/about/index.htm # Exercise 2: ALL Phenotypic Data This data comes from an (old) Acute Lymphoid Leukemia microarray data set. Choose the file that contains ALL (acute lymphoblastic leukemia) patient information and input the date using `read.csv()`; for `read.csv()`, use `row.names=1` to indicate that the first column contains row names. ```{r ALL-choose, eval=FALSE} path <- file.choose() # look for ALL-phenoData.csv ``` ```{r ALL-input} stopifnot(file.exists(path)) pdata <- read_csv(path) pdata ``` Use `select()` to select some columns, e.g., `mol.biol` and `BT`. Use `filter()` to filter to include only females over 40. ```{r All-select} pdata %>% select(mol.biol, BT) pdata %>% filter(sex %in% "F", age > 40) ``` Use the `mol.biol` column to filter the data to contain individuals in the set `c("BCR/ABL", "NEG")` (i.e., they have `mol.biol` equal to `BCR/ABL` or `NEG`)) ```{r ALL-BCR/ABL-subset} bcrabl <- pdata %>% filter(mol.biol %in% c("BCR/ABL", "NEG")) ``` We'd like to tidy the data by mutating `mol.biol` to be a factor. We'd also like to mutate the `BT` column (B- or T-cell subtypes) to be just `B` or `T`, using `substr(BT, 1, 1)` (i.e., for each element of `BT`, taking the substring that starts at letter 1 and goes to letter 1 -- the first letter) ```{r bcrabl-mutate} bcrabl <- bcrabl %>% mutate( mol.biol = factor(mol.biol), B_or_T = factor(substr(BT, 1, 1)) ) ``` How many bcrabl samples have B- and T-cell types in each of the BCR/ABL and NEG groups? ```{r ALL-BCR/ABL-BT} bcrabl %>% group_by(B_or_T, mol.biol) %>% summarize(N = n()) ``` Calculate the average age of males and females in the BCR/ABL and NEG treatment groups. ```{r ALL-aggregate} bcrabl %>% group_by(sex, mol.biol) %>% summarize(age = mean(age, na.rm=TRUE)) ``` Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface and `.` to refer to the incoming data set. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? ```{r ALL-age} bcrabl %>% t.test(age ~ mol.biol, .) bcrabl %>% boxplot(age ~ mol.biol, .) ```