---
title: "04: Practical: Organizing data with _SummarizedExperiment_"
author:
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteIndexEntry{04: Practical: Organizing data with SummarizedExperiment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
options(width = 75)
```

# What is a `SummarizedExperiment`?

![Alt SummarizedExperiment](our_figures/SummarizedExperiment.png)

- `assay()`, `assays()`: A `matrix`-like or list of `matrix`-like
  objects of identical dimension
  - `matrix`-like: implements `dim()`, `dimnames()`, and 2-dimensional
    `[`, `[<-` methods.
  - rows: genes, genomic coordinates, etc.
  - columns: samples, cells, etc.
- `colData()`: Annotations on each column, as a `DataFrame`.
  - E.g., description of each sample
- `rowData()` and / or `rowRanges()`: Annotations on each row.
  - E.g., `rowRanges()`: coordinates of gene / exons in transcripts /
    etc.
  - E.g., `rowData()`:_P_-values and log-fold change of each gene
    after differential expresssion analysis.
- `metadata()`: List of unstructured metadata describing the overall
  content of the object.


# Working with an existing `SummarizedExperiment` object

```{r, message = FALSE}
library(SummarizedExperiment)
```

The [airway][] experiment data package summarizes an RNA-seq
experiment investigating human smooth-muscle airway cell lines treated
with dexamethasone. Load the library and data set.

```{r}
library(airway)
data(airway)
airway
```

## Three main parts of a `SummarizedExperiment`

**Exercise** What are the `dim()` and `dimnames()` of the airway
object? What do the rows and columns correspond to in terms of the
original experiment? Hint: use `dplyr::glimpse()` to look at the first
few elements of each `dimnames`.

**Exercise** Use the functions `assay()`, `colData()`, `rowRanges()`
to extract the three major components of the `airway` data
set. `assay()` is large, so interogate it using `class()`, `dim()` and
`head()` rather than simply printing it to the screen. What do each of
these parts corresspond to in terms of the RNASeq experiment
summarized in this object?

## The matrix-like behavior of `SummarizedExperiment`

As a quick refresher, an _R_ `matrix`

```{r}
set.seed(123)
m <- matrix(
    rnorm(12), nrow = 4, ncol = 3,
    dimnames = list(letters[1:4], LETTERS[1:3])
)
m
```

has three main components to it's interface, `dim()`, `dimnames()`,
and two-dimensional `[`-subsetting.

```{r}
dim(m)
dimnames(m)
m[1:2, 2:1]
m[1:2,]
```

A tricky, and unfortunate, behavior is that subsetting a single row or
column of a matrix results, by default in the matrix being coerced to
a vector

```{r}
m[1,]
m[,1]
```

Avoid this behavior using the `drop = FALSE` argument.

```{r}
m[ 1, , drop = FALSE]
```

`SummarizedExperiment` implements the 'matrix-like' interface. This
means that it supports the functions `dim()`, `dimnames()`, and
two-dimensional `[`-subsetting.

**Exercise** Subset `airway` to contain the first 5 rows and first 4
columns. Check out the `colData()`, `rowRanges()`, and `assay()` of
the resulting object.

## Basic summaries of `assay()` values

Returning to our simple matrix

```{r}
m
```

It is straight-forward to transform a matrix, e.g., by adding 1

```{r}
m + 1
```

or applying a transformation

```{r}
abs(m + 1)
```

There are special functions for row- and column-wise operations that
are often performed on matrices (check out the [matrixStats][] package
for further mathematical operations), e.g.,

```{r}
colSums(m)
rowMeans(m)
```

[matrixStats]: https://cran.r-project.org/package=matrixStats

**Exercise** Extract the `assay()` component of `airway` and calculate
the column sums. What do these numbers represent?

**Exercise** Transform the `assay()` matrix by adding 1 and taking the
natural `log()`, then create a vector representing the average of each
row of this transformed matrix. What does this vector represent?
Visualize it in a histogram, density plot, or other representation.

## Subsetting `SummarizedExperiment`

We mentioned that the summarized experiment implements the matrix-like
two-dimensional `[` subsetting interface. 

The `colData()` of `airway` contains information about experimental
design, for instance the cell line and dexamethasone treatment level
of each sample.

```{r}
colData(airway)
```

There's a short-cut to accessing each column, e.g.,

```{r}
airway$dex
```

**Exercies** Use this shortcut to quickly subset `airway` so that it
has only the `untrt` samples present. Verify that the `assay()` data
have also been subset correctly.

When working with a matrix, common scenario is to manipulate groups
of rows based on some criterion, e.g., if one wished to keep rows with
mean above 0, one might

```{r}
ridx <- rowMeans(m) > 0
m[ridx, , drop = FALSE]
```

**Exercise** Remove all the rows from `airway` that had no gene
expression across all samples. How many rows had no expression?
Re-calculate the histogram of log-transoformed average expression with
these rows excluded.

## The list-like interface of `GRangesList`

An _R_ `list()` is an object that contain different types of vectors,
including other lists. Here's a simple example

```{r}
l <- list(a = 1:5, b = month.abb)
```

The interface to a list allows for `length()`, `[` (to return a subset
of the original list) and `[[` (to extract a single element of the
list, either by name or by position). The elements of the list can,
but do not have to be, named.

```{r}
names(l)
length(l)
l[c(2, 1)]
l[2]        # list of length 1, containing element 2 of original list
l[[2]]      # element 2 of original list
```

One useful function is `lengths()`, which returns a vector of the
length of each element in the lsit

```{r}
lengths(l)
```

The [IRanges][], [S4Vectors][], and [GenomicRanges][] packages
implement several classes that implement this list-like interface; the
classes all end with name `*List`, e.g., `GRangesList`.

**Exercise** Extract the row ranges from `airway`, use `class()` to
discover the type of object. What does each element of the list
represent? What does the entire list represent?

```{r}
r <- rowRanges(airway)
```

**Exercise** Prove to yourself that the object implements a list-like
interface, supporting `length()`, `[` and `[[` subsetting.

**Exercise** Use `lengths()` to determine the number of exons in each
gene. How many genes are there with a single exon? What gene has the
largest number of exons?

## Provenance in `GRangesList`

`GRanges` / `GRangesList` objects contain provenance information about
the genome to which the ranges refer, accessible using `seqinfo()`. 

**Exercise** Use `seqinfo(r)` to extract the information about the
sequences the genomic ranges are based on.

**Exercise** Note that the `genome` is always `NA`. Take a quick look
at the vignette `browseVignettes("airway")`, scaning down to the
'Aligning reads' section. We're told that the reference genome used
for alignments is "GRCh37". Update the row range to contain this
information

```{r}
genome(r) <- "GRCh37"
seqinfo(r)
```

Update the `rowRanges()` of `airway` with this more complete
information.

```{r}
rowRanges(airway) <- r
```

Note that `SummarizedExperiment` has convenience functions that
allow direct access to provenance

```{r}
seqinfo(airway)
```

**Exercise** Use `as()` to coerce the sequence info to a `GRanges`
object, and select the genomic range corressponding to chromosome 14.

```{r, message = FALSE}
library(dplyr)    # %>%
```

```{r}
chr14 <- 
    seqinfo(r) %>%
    as("GRanges") %>%
    subset(seqnames == "14")
```

## Subsetting by overlaps

The `%over%` function returns TRUE for each range on the left of the
operator that overlaps a range on the right of the operator. Thus the
row ranges overlapping chromosome 14 are

```{r}
idx <- r %over% chr14
r[idx]
```

**Exercise** How many ranges of `r` overlap chromosome 14?

**Exercise** Use `idx` to subset `airway` to contain only genes on
chromosome 14.

**Exercise** We took a long route to get to this subset-by-overlap;
summarize our story with two lines of code that allow us to subset
`airway` by overlap.

# Constructing a `SummarizedExperiment` object 'by hand'

```{r, message = FALSE}
library(readr)
library(tibble)
```

```{r, eval = FALSE}
pdata_file <- file.choos()    # airway-sample-sheet.csv
counts_file <- file.choose()  # airway-read-counts.csv
```
```{r, echo = FALSE}
pdata_file <-
    system.file(package="BiocIntro", "extdata", "airway-sample-sheet.csv")
counts_file <-
    system.file(package="BiocIntro", "extdata", "airway-read-counts.csv")
```

**Exercise** Use `read_csv()` to import the sample sheet and
read counts.

```{r}
pdata <- read_csv(pdata_file)
counts <- read_csv(counts_file)
```

**Exercise** Assemble `SummarizedExperiment` from transposed `counts`
and `pdata`. The `"Run"` column needs to be made into rownames on both
the `pdata` and `counts` object.

```{r}
pdata <- column_to_rownames(pdata, "Run")
counts <- column_to_rownames(counts, "Run")
se <- SummarizedExperiment(t(counts), colData = pdata)
se
```

**Exercise** Following examples from the lectures, calculate total
library size and plot average log gene expression from you
newly-created `SummarizedExperiment`.

# Provenance

```{r}
sessionInfo()
```

[IRanges]: https://bioconductor.org/packages/IRanges
[S4Vectors]: https://bioconductor.org/packages/S4Vectors
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[MultiAssayExperiment]: https://bioconductor.org/packages/MultiAssayExperiment
[airway]: https://bioconductor.org/packages/airway