---
title: "A.2 -- Data Input and Manipulation"
author: "Martin Morgan <Martin.Morgan@RoswellPark.org>"
date: "11 - 12 September 2017"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{A.2 -- Data Input and Manipulation}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
suppressPackageStartupMessages({
    library(tidyverse)
})
```

# The tidyverse

We'll adopt a particular approach to data input, wrangling, and basic
analysis known as the 'tidyverse'. Start by loading the [tidyverse][]
package.

```{r}
library(tidyverse)
```

We'll cover the following functions:

- Data input
    - `read_csv()`: data input from a comma-separate value file, as a
      (`data.frame`-like) `tibble`.
- Pipes
    - `%>%`: 'pipe' data from a source to a function.
    - `%$%`: extract a column in a pipe.
    - `.`: refer to the incoming data.
- Data manipulation
    - `group_by()`: define groups of rows based on column values
    - `summarize()`: apply functions to groups of data to produce a
      summary of the data.
    - `filter()`: filter rows to match criteria in columns
    - `select()`: select columns for subsequent use
    - `mutate()`: update or add columns of data.
- Other functions and concepts
    - `%in%`: identify elements of the left-hand vector that are
      elements of the set defined by the right-hand vector.
    - `t.test()`: perform a t-test.
    - `boxplot()`, `hist()`: basic visualization.
    - `~`: specify a formula describing the relationship between a
      dependent (left-hand side) variable and independent (right-hand
      side) variable(s).

[tidyverse]: https://cran.r-project.org/package=tidyverse
[magrittr]: https://cran.r-project.org/package=magrittr

# Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its
extensive Behavioral Risk Factor Surveillance System ([BRFSS][])
telephone survey. Check out the link for more information. We'll look
at a subset of the data.

1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv'

    ```{r file.choose, eval=FALSE}
    path <- file.choose()
    ```

<!--
    ```{r file.path, echo=FALSE}
    path <- file.path("extdata", "BRFSS-subset.csv")
    ```
-->

2. Input the data using `read_csv()`, assigning to a variable `brfss`
   and visualizing the first few rows.

    ```{r read.csv}
    brfss <- read_csv(path)
    brfss
    ```

3. From looking at the data...

    - How many individuals aer in the sample

    - What variables have been measured?

    - Can you guess at the units used for, e.g., Weight and Height?
    
4. The tidyverse uses a 'pipe', `%>%` to send data from one command to
   another. There are small number of key functions for manipulating
   data. We'll use `group_by()` to group the data by `Sex`, and then
   `summarize(n=n())` to count the number of observations in each
   group.

    ```{r brfss-sex}
    brfss %>% group_by(Sex) %>% summarize(N = n())
    ```
    
5. Use `group_by(Year, Sex)` and `summarize(N = n())` to summarize the
   number of individuals from each year and sex.

    ```{r brfss-sex-year}
    brfss %>% group_by(Year, Sex) %>% summarize(N = n())
    ```

6. Calculate the average age in each year and sex by adding the
   argument `Age = mean(Age, na.rm=TRUE)` to `summarize()`


    ```{r brfss-mean-age}
    brfss %>% group_by(Year, Sex) %>% 
        summarize(N = n(), Age = mean(Age, na.rm=TRUE))
    ```

6. `Year` is input as an integer vector, and Sex as a character
   vector. Actually, though, these are both factors. Use `mutate()`
   and `factor()` to update the type of these columns. Re-assign the
   updated tibble to `brfss`
   
    ```{r brfss-mutate}
    brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))
    brfss <- brfss %>% mutate(Year = factor(Year), Sex = factor(Sex))
    ```
   
7. There are several other pipes available (see also the [magrittr][]
   package). `%$%` extracts a column. Here we look at the `levels()`
   of the factor that we created.
   
    ```{r brfss-levels}
    brfss %$% Sex %>% levels()
    brfss %$% Year%>% levels()
    ```

8. It's usually better to 'clean' data as soon as possible. Visit the
   help page `?read_csv`, look at the `col_types =` argument, and the
   help pages `?cols` and `?col_factor`. Input the data in it's
   correct format, with Sex and Year as factors
   
    ```{r brfss-read-cols}
    col_types <- cols(
        Age = col_integer(),
        Weight = col_double(),
        Sex = col_factor(c("Female", "Male")),
        Height = col_double(),
        Year = col_factor(c("1990", "2010"))
    )
    brfss <- read_csv(path, col_types = col_types)
    brfss
    brfss %>% summary()
    ```
       
9. Use `filter()` to create a subset of the data consisting of only
   the 1990 observations (`Year` in the set that consists of the
   single element `1990`, `Year %in% 1990`). Optionally, save this to
   a new variable `brfss_1990`.

    ```{r filter}
    brfss %>% filter(Year %in% 1990)
    brfss_1990 <- brfss %>% filter(Year %in% 1990)
    ```
   
10. Pipe this subset to `t.test()` to ask whether Weight depends on
    Sex. The first argument to `t.test` is a 'formula' describing the
    relation between dependent and independent variables; we use the
    formula `Weight ~ Sex`. The second argument to `t.test` is the
    data set to use -- indicate the data from the pipe with `data = .`

    ```{r t-test-1990}
    brfss %>% filter(Year %in% 1990) %>% t.test(Weight ~ Sex, data = .)
    ```
   What about differences between weights of males (or females) in
   1990 versus 2010?

11. Use `boxplot()` to plot the weights of the Male individuals.  Can
    you transform weight, e.g., taking the square root, before
    plotting?  Interpret the results. Do similar boxplots for the
    t-tests of the previous question.

    ```{r brfss-boxplot, fig.width=5, fig.height=5}
    brfss %>% filter(Sex %in% "Male") %>% boxplot(Weight ~ Year, data = .)
    brfss %>% filter(Sex %in% "Male") %>% mutate(SqrtWeight = sqrt(Weight)) %>%
        boxplot(SqrtWeight ~ Year, data = .)
    ```

12. Use `hist()` to plot a histogram of weights of the 1990 Female
    individuals. From `?hist`, the function is expecting a vector of
    values, so use `%$%` to select the `Weight` column and pipe to
    `hist()`.

    ```{r brfss-hist, fig.width=5, fig.height=5}
    brfss %>% filter(Year %in% "1990", Sex %in% "Female") %$% Weight %>% 
        hist(main="1990 Female Weight")
    ```

[BRFSS]: http://www.cdc.gov/brfss/about/index.htm

# Exercise 2: ALL Phenotypic Data

This data comes from an (old) Acute Lymphoid Leukemia microarray data
set.

Choose the file that contains ALL (acute lymphoblastic leukemia)
patient information and input the date using `read.csv()`; for
`read.csv()`, use `row.names=1` to indicate that the first column
contains row names.

```{r ALL-choose, eval=FALSE}
path <- file.choose()    # look for ALL-phenoData.csv
```

<!--
```{r echo=FALSE}
path <- file.path("extdata", "ALL-phenoData.csv")
```
-->

```{r ALL-input}
stopifnot(file.exists(path))
pdata <- read_csv(path)
pdata
```

Use `select()` to select some columns, e.g., `mol.biol` and `BT`. Use
`filter()` to filter to include only females over 40.

```{r All-select}
pdata %>% select(mol.biol, BT)
pdata %>% filter(sex %in% "F", age > 40)
```

Use the `mol.biol` column to filter the data to contain individuals in
the set `c("BCR/ABL", "NEG")` (i.e., they have `mol.biol` equal to
`BCR/ABL` or `NEG`))

```{r ALL-BCR/ABL-subset}
bcrabl <- pdata %>% filter(mol.biol %in% c("BCR/ABL", "NEG"))
```

We'd like to tidy the data by mutating `mol.biol` to be a factor. We'd
also like to mutate the `BT` column (B- or T-cell subtypes) to be just
`B` or `T`, using `substr(BT, 1, 1)` (i.e., for each element of `BT`,
taking the substring that starts at letter 1 and goes to letter 1 --
the first letter)

```{r bcrabl-mutate}
bcrabl <- bcrabl %>% mutate(
    mol.biol = factor(mol.biol), 
    B_or_T = factor(substr(BT, 1, 1))
)
```

How many bcrabl samples have B- and T-cell types in each of the
BCR/ABL and NEG groups?

```{r ALL-BCR/ABL-BT}
bcrabl %>% group_by(B_or_T, mol.biol) %>% summarize(N = n())
```

Calculate the average age of males and females in the BCR/ABL and NEG
treatment groups.

```{r ALL-aggregate}
bcrabl %>% group_by(sex, mol.biol) %>% summarize(age = mean(age, na.rm=TRUE))
```

Use `t.test()` to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using `boxplot()`. In both cases,
use the `formula` interface and `.` to refer to the incoming data
set. Consult the help page `?t.test` and re-do the test assuming that
variance of ages in the two groups is identical. What parts of the
test output change?

```{r ALL-age}
bcrabl %>% t.test(age ~ mol.biol, .)
bcrabl %>% boxplot(age ~ mol.biol, .)
```