Contents

1 Start your markdown document

Create a new .Rmd file

Process the .Rmd file

Anatomy

More help on markdown under “Help” -> "Markdown Quick Reference

Exercise To get going, leave the YAML header, but remove everything else.

2 Working with data

2.1 library(), install.packages(), and BiocManager::install()

We’ll use the following libraries; attach them to your current R session.

library(readr)
library(dplyr)

If R responds

Error in library(readr) : there is no package called 'readr'

or similar, it means that the library needs to be installed. The ‘Bioconductor’ way of library installation (able to install CRAN and Bioconductor packages) is to use the BiocManager package. Make sure you have the BiocManager package installed

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

then install the packages you need

BiocManager::install(c("readr", "dplyr"))

2.2 read_csv()

Find the data file “ALL-sample-sheet.csv”

pdata_file <- file.choose()

then input the data using read_csv()

pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   age = col_double(),
##   `t(4;11)` = col_logical(),
##   `t(9;22)` = col_logical(),
##   cyto.normal = col_logical(),
##   ccr = col_logical(),
##   relapse = col_logical(),
##   transplant = col_logical()
## )
## See spec(...) for full column specifications.

If R responds

Error in read_csv() : could not find function "read_csv"

this means that the readr package has not been attached to the R session; remedy with library(readr).

Take a peak at the data to see what we have

pdata
## # A tibble: 128 x 22
##    sample cod   diagnosis sex     age BT    remission CR    date.cr
##    <chr>  <chr> <chr>     <chr> <dbl> <chr> <chr>     <chr> <chr>  
##  1 01005  1005  5/21/1997 M        53 B2    CR        CR    8/6/19…
##  2 01010  1010  3/29/2000 M        19 B2    CR        CR    6/27/2…
##  3 03002  3002  6/24/1998 F        52 B4    CR        CR    8/17/1…
##  4 04006  4006  7/17/1997 M        38 B1    CR        CR    9/8/19…
##  5 04007  4007  7/22/1997 M        57 B2    CR        CR    9/17/1…
##  6 04008  4008  7/30/1997 M        17 B1    CR        CR    9/27/1…
##  7 04010  4010  10/30/19… F        18 B1    CR        CR    1/7/19…
##  8 04016  4016  2/10/2000 M        16 B1    CR        CR    4/17/2…
##  9 06002  6002  3/19/1997 M        15 B2    CR        CR    6/9/19…
## 10 08001  8001  1/15/1997 M        40 B2    CR        CR    3/26/1…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## #   `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## #   `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## #   relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>

This represents 22 variables collected on 128 samples from a classic microarray experiment. Some of the column names are cryptic; we’ll only focus on a few.

3 Update your markdown document

Add a top-level section title (start a line with #) like

# Day 1: R / Bioconductor practical

Add notes about the key functions you’ve learned about so far, using a bulleted list like

Key functions: 

- `file.choose()`: interactively choose a file location.
- `library(readr)`: attach the readr package to the current _R_ session
- `read_csv()`

Press the ‘Knit’ button, debugging any errors that occur.

Create a code chunk containing instructions for data input, using a ‘fence’ followed by {r}

```{r}
library(readr)
pdata_file = "path/to/file"
pdata <- readr(pdata_file)
pdata
```

Press the ‘Knit’ button. Note that the code is actually evaluated, and when correct will import the data from the pdata_file.

4 Working with data (continued): exploration

4.1 arrange()

Use arrange() to order the rows from youngest to oldest individual

pdata %>% arrange(age)
## # A tibble: 128 x 22
##    sample cod   diagnosis sex     age BT    remission CR    date.cr
##    <chr>  <chr> <chr>     <chr> <dbl> <chr> <chr>     <chr> <chr>  
##  1 08018  8018  8/27/1999 M         5 B3    CR        CR    10/18/…
##  2 19017  19017 3/17/2000 M        14 T2    CR        CR    6/15/2…
##  3 06002  6002  3/19/1997 M        15 B2    CR        CR    6/9/19…
##  4 25003  25003 5/22/1998 M        15 B2    CR        CR    8/4/19…
##  5 04016  4016  2/10/2000 M        16 B1    CR        CR    4/17/2…
##  6 24010  24010 6/3/1997  F        16 B2    CR        CR    8/11/1…
##  7 24019  24019 5/4/1999  M        16 B4    CR        CR    7/28/1…
##  8 28001  28001 10/19/19… M        16 B3    REF       REF   <NA>   
##  9 28044  28044 7/20/1999 M        16 B3    CR        CR    9/20/1…
## 10 01007  1007  9/30/1998 F        16 T3    CR        CR    11/30/…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## #   `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## #   `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## #   relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>

If R responds with

> pdata %>% arrange(age)
Error in pdata %>% summarize(n = n()) : could not find function "%>%"

then we need to attach the dplyr package, library(dplyr).

Arrange from oldest to youngest with

pdata %>% arrange(desc(age))
## # A tibble: 128 x 22
##    sample cod   diagnosis sex     age BT    remission CR    date.cr
##    <chr>  <chr> <chr>     <chr> <dbl> <chr> <chr>     <chr> <chr>  
##  1 16004  16004 4/19/1997 F        58 B1    CR        CR    7/15/1…
##  2 20002  20002 5/9/1997  F        58 B2    CR        CR    8/19/1…
##  3 04007  4007  7/22/1997 M        57 B2    CR        CR    9/17/1…
##  4 24017  24017 9/15/1998 M        57 B2    CR        CR    12/7/1…
##  5 08012  8012  10/22/19… M        55 B3    CR        CR    1/9/19…
##  6 28021  28021 3/18/1998 F        54 B3    CR        DEAT… 5/22/1…
##  7 30001  30001 1/16/1997 F        54 B3    <NA>      DEAT… <NA>   
##  8 43007  43007 10/14/19… M        54 B4    CR        CR    12/30/…
##  9 62002  62002 1/15/1998 M        54 B4    <NA>      DEAT… <NA>   
## 10 01005  1005  5/21/1997 M        53 B2    CR        CR    8/6/19…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## #   `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## #   `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## #   relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>

Exercise How would you use arrange() to order the samples from youngest to oldest, with individuals of equal age ordered by sex?

4.2 summarize() and group_by()

Let’s ask the average age of males and females. First, use summarize() and the n() function to create a tibble with the number of samples of each sex.

pdata %>% summarize(n = n())
## # A tibble: 1 x 1
##       n
##   <int>
## 1   128

We now know how many samples there are. Use group_by() to tell dplyr to separately consider each sex

pdata %>% group_by(sex)

and then perform the summary operation…

pdata %>% group_by(sex) %>% summarize(n = n())
## # A tibble: 3 x 2
##   sex       n
##   <chr> <int>
## 1 F        42
## 2 M        83
## 3 <NA>      3

Exercise What is the NA sex? Use filter() to identify the 3 records with is.na(sex).

Exercise Use arrange() and desc() to arrange the summary table from most to least frequent.

Exercise Add another column ave_age to summarize(), using mean() to calculate the average age by sex. You’ll need to use the na.rm = TRUE argument to calculate an average age in the face of missing values.

4.3 t.test(), an untidy function

We’d like to compare the average age of males and females in the study using t.test(). t.test() takes a formula age ~ sex (age as a function of sex) describing the comparison that we’d like to make. It also takes an argument data= that contains the data we’d like to perform the t-test on. Unlike functions we’ve encountered so far where the data to be processed is the first argument, t.test() expects the data as the second argument. To adapt t.test() for use, we need to explicitly indicate that the data should be the second argument. One way of doing this is to use the special symbol . to represent the location of the incoming data, invoking t.test(age ~ sex, data = .):

pdata %>% t.test(age ~ sex, data = .)

Exercise Perform a t-test to ask whether there is evidence of differences in ages between the sexes. How can we change the default value of var.equal to TRUE? Is this appropriate?

Exercise (advanced) The return value of t.test() doesn’t fit well with tidy data analysis, because it is a complicated object that is not represented as a tibble and hence cannot be computed upon using the common tidy verbs. Write a simple function t_test() that is more tidy-friendly, accepting data = as it’s first argument, using t.test() internally to compute results, and returning a tibble containing results formatted for subsequent computation. Explore the broom::tidy() function as a way to transform many base R objects into tidy-friendly data structures.

4.4 filter() and %in%

Take a look at the mol.biol column, using group_by() and the convenience function count()

pdata %>%
    group_by(mol.biol) %>%
    count()
## # A tibble: 6 x 2
## # Groups:   mol.biol [6]
##   mol.biol     n
##   <chr>    <int>
## 1 ALL1/AF4    10
## 2 BCR/ABL     37
## 3 E2A/PBX1     5
## 4 NEG         74
## 5 NUP-98       1
## 6 p15/p16      1

These represent chromosomal events such as the presence of the BCR/ABL fusion gene, or NEG (standard) chromosomes.

Exercise Use filter() to create a new data set bcrabl that contains all 22 columns but only the BCR/ABL and NEG rows. It will be helpful to use the %in% operator, which returns TRUE when an object in the set on the left-hand side is present in the object on the right-hand side.

c("a", "b", "c", "b", "a") %in% c("a", "c")
## [1]  TRUE FALSE  TRUE FALSE  TRUE

Exercise Use t.test() to ask whether there are age differences between these molecular biologies.

5 Update your markdown document

Continue your markdown document by summarizing in a bulleted list the key functions covered in this section. Include examples of code inside fences, and make sure the code is evaluated when the “Knit” button is pressed.

6 Working with data (continued): visualization

6.1 ggplot()

Load the ggplot2 library.

library(ggplot2)

Plots are created by providing a source for the data to be displayed, as well as ‘aesthetics’ aes() describing the data to be used.

ggplot(pdata, aes(x = sex, y = age))

6.2 geom_boxplot()

This is enough to determine the overall dimensions of the display, but we still need to add ‘geometric’ geom_*() objects that display the relationships of interest. Add a boxplot geom for an initial visualization.

ggplot(pdata, aes(x = sex, y = age)) + 
    geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

6.3 geom_jitter()

Add a second layer that superimposes the actual data using geom_jitter() invoked so that the sex value is displaced a random amount about its horizontal location (width = .1) while the age is plotted at its actual vertical location (height = 0).

ggplot(pdata, aes(x = sex, y = age)) + 
    geom_boxplot() + 
    geom_jitter(width = .1, height = 0)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5 rows containing missing values (geom_point).

Exercise How might pdata be filtered so that the samples with unknown age are excluded, and hence there are only two categories on the x-axis?

Exercise Are there other displays that convey both the summary of results and the actual data?

7 Update your markdown document

8 Working with data (continued): Bioconductor objects

Create a short character vector of DNA sequences.

sequences <- c("AAATCGA", "ATACAACAT", "TTGCCA")

In base R we can ask about properties of these sequences and perform some operations

sequences
## [1] "AAATCGA"   "ATACAACAT" "TTGCCA"
length(sequences)
## [1] 3
nchar(sequences)
## [1] 7 9 6
sequences[c(1, 3)]
## [1] "AAATCGA" "TTGCCA"
sample(sequences)
## [1] "TTGCCA"    "AAATCGA"   "ATACAACAT"

Base R has no notion of operations relevant to DNA sequences, e.g.,

reverseComplement(sequences)

fails. Likewise, we can name a variable anything, the semantic meaning of the variable name is not enforced by R

my_sequence <-
    "All the world's a stage, And all the men and women merely players"

8.1 library(Biostrings)

Load the Biostrings library

library(Biostrings)

If R responds with

> library(Biostrings)
Error in library(Biostrings) : there is no package called 'Biostrings'

then it is necessary to install the package first (make sure you’ve spelt the package name correctly, including capitalization!)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Biostrings")
library(Biostrings)

8.2 DNAStringSet()

Create a DNAStringSet from our character vector

dna <- DNAStringSet(sequences)
dna
##   A DNAStringSet instance of length 3
##     width seq
## [1]     7 AAATCGA
## [2]     9 ATACAACAT
## [3]     6 TTGCCA

Exercise Does the object dna support the operations illustrated above for a character vector, especially length(), nchar(), [, and sample()?

Exercise Prove to yourself that at least some other useful, DNA-specific, functions exist, e.g., reverse() and reverseComplement().

Exercise What happens when you try to create a DNAStringSet() from an object such as my_sequence, defined above, that does not contain a DNA sequence? Warning: the message is quite cryptic, can you provide a ‘human’ translation?

Exercise Why does DNAStringSet("ACGTMRW") not create an error, since MRW are not standard nucleotides? For hints, see the section ’The DNA alphabet:" on the help page ?DNAString.

8.3 methods(), help, and browseVignettes()

The function DNAStringSet() returns an object that has a particular class

class(dna)
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"

Associated with the class are a series of methods that operate on the class.

Exercise Discover many (unfortunately, not all) methods acting on an object of class DNAStringSet using methods(class = "DNAStringSet"). Verify that reverseComplement is among those methods.

Help pages describing a particular method can be found using ?, with the search query quoted and with tab-completion providing hints on what the appropriate help topic is.

Exercise Find the help page for the reverseComplement method operating on a DNAStringSet object, using ?"reverseComplement,DNAStringSet-method".

Help pages provide a description of the technical details required for creating classes and using methods. Vignettes provide a more narrative description of overall package use.

Exercise Use browseVignettes(package = "Biostrings") to see vignettes available for this package; explore a few vignettes to get a sense of possible content.

8.4 readDNAStringSet()

It is unlikely that we would enter 1000’s of DNA sequences ‘by hand’. Instead, we might read the data from a standard file format. For DNA sequences the standard file format is often a ‘FASTA’ file, sometimes abbreviated with an extension .fa and often compressed with an additional extension .fa.gz. An example of a FASTA file containing DNA sequences of the 2000bp upstream nucleotides of all genes annotated in the Drosophila melanogaster dm3 genome build, is distributed with the Biostrings package. Here’s the path to the FASTA file.

fa_file <-
    system.file(package="Biostrings", "extdata", "dm3_upstream2000.fa.gz")

Exercise Take a peak at the structure of a FASTA file by looking at the first five lines.

readLines(fa_file, 5)
## [1] ">NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736"
## [2] "gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt"         
## [3] "gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg"         
## [4] "tttttgtgcttttcgaacaaaaaattgggagcagaatattggattcgctt"         
## [5] "ttttcgcatcgaatatcttaagggagcaagggaagaaccatctagaataa"

The first line is an identifier, containing information about the gene NM_078863 as well as the genomic coordinates of the sequence chr2L:16764737-16766736. The next lines are the DNA sequence. After a certain number of lines, a new record starts.

tail(readLines(fa_file, 44), 5)
## [1] "cacgcacaccgatcgtcgaatcgaaaagctttcggggtcttacgggatcc"         
## [2] "atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt"         
## [3] ">NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454"
## [4] "ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg"         
## [5] "cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg"

We could fairly ‘easily’ write our own parser for this format, but this would be error-prone and unnecessary.

Exercise Input the file

dna <- readDNAStringSet(fa_file)
dna
##   A DNAStringSet instance of length 26454
##         width seq                                      names               
##     [1]  2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
##     [2]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
##     [3]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
##     [4]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
##     [5]  2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
##     ...   ... ...
## [26450]  2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453]  2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454]  2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...

Exercise Query the object for basic properties, e.g., it’s length() and that the number of characters in each sequence is 2000 unique(nchar(dna)).

8.5 letterFrequency(): calculate GC content

Exercise Use letterFrequency() to determine GC content of each of the DNA sequences in dna. The letters argument should be "GC"; as.prob = TRUE returns values between 0 and 1. The data is returned as a matrix with 1 column.

Exercise Plot the distribution of GC frequencies in the dna object using base graphics hist() and plot(density()), and using ggplot().

8.6 Tidy Bioconductor?

Although Bioconductor emphasizes formal objects like DNAStringSet rather than tibble-like data frames, some of the ways one interacts with tidy data can be applied to Bioconductor objects. For instance, the GC content example might be written in ‘traditional’ form as

gc <- letterFrequency(dna, "GC", as.prob = TRUE)

but could be written using pipes and to reesult in a tibble for easier down-stream manipulation

gc <- 
    dna %>%
    letterFrequency("GC", as.prob = TRUE) %>%
    tibble::as_tibble()
gc
## # A tibble: 26,454 x 1
##    `G|C`
##    <dbl>
##  1 0.378
##  2 0.43 
##  3 0.43 
##  4 0.43 
##  5 0.43 
##  6 0.43 
##  7 0.43 
##  8 0.43 
##  9 0.43 
## 10 0.43 
## # … with 26,444 more rows

Exercise Revise other code, above, to follow this style of analysis. Are there merits / limitations of base verses ‘tidy’ approaches?

Exercise Are there reasons why we did not just put our sequences vector of DNA sequences into a tibble, and go from there?

9 Update your markdown document

Add further notes to the markdown document summarizing the functions and objects you’ve learned in this practical.

10 Provenance: finish your markdown document

Finish your markdown document with a fenced section that evaluates the sessionInfo() command.

```{r}
sessionInfo()
```

Here are the packages used for this practical:

sessionInfo()
## R version 3.6.0 Patched (2019-04-26 r76431)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] Biostrings_2.53.0   XVector_0.25.0      IRanges_2.19.10    
## [4] S4Vectors_0.23.17   BiocGenerics_0.31.4 ggplot2_3.2.0      
## [7] dplyr_0.8.2         readr_1.3.1         BiocStyle_2.13.2   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1           pillar_1.4.2         compiler_3.6.0      
##  [4] BiocManager_1.30.5.1 zlibbioc_1.31.0      tools_3.6.0         
##  [7] zeallot_0.1.0        digest_0.6.19        evaluate_0.14       
## [10] tibble_2.1.3         gtable_0.3.0         pkgconfig_2.0.2     
## [13] rlang_0.4.0          cli_1.1.0            yaml_2.2.0          
## [16] xfun_0.8             withr_2.1.2          stringr_1.4.0       
## [19] knitr_1.23           vctrs_0.1.0          hms_0.4.2           
## [22] grid_3.6.0           tidyselect_0.2.5     glue_1.3.1          
## [25] R6_2.4.0             fansi_0.4.0          rmarkdown_1.13      
## [28] bookdown_0.11        purrr_0.3.2          magrittr_1.5        
## [31] backports_1.1.4      scales_1.0.0         codetools_0.2-16    
## [34] htmltools_0.3.6      assertthat_0.2.1     colorspace_1.4-1    
## [37] labeling_0.3         utf8_1.1.4           stringi_1.4.3       
## [40] lazyeval_0.2.2       munsell_0.5.0        crayon_1.3.4