--- title: "04: Practical: Organizing data with _SummarizedExperiment_" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{04: Practical: Organizing data with SummarizedExperiment} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) options(width = 75) ``` # What is a `SummarizedExperiment`? ![Alt SummarizedExperiment](our_figures/SummarizedExperiment.png) - `assay()`, `assays()`: A `matrix`-like or list of `matrix`-like objects of identical dimension - `matrix`-like: implements `dim()`, `dimnames()`, and 2-dimensional `[`, `[<-` methods. - rows: genes, genomic coordinates, etc. - columns: samples, cells, etc. - `colData()`: Annotations on each column, as a `DataFrame`. - E.g., description of each sample - `rowData()` and / or `rowRanges()`: Annotations on each row. - E.g., `rowRanges()`: coordinates of gene / exons in transcripts / etc. - E.g., `rowData()`:_P_-values and log-fold change of each gene after differential expresssion analysis. - `metadata()`: List of unstructured metadata describing the overall content of the object. # Working with an existing `SummarizedExperiment` object ```{r, message = FALSE} library(SummarizedExperiment) ``` The [airway][] experiment data package summarizes an RNA-seq experiment investigating human smooth-muscle airway cell lines treated with dexamethasone. Load the library and data set. ```{r} library(airway) data(airway) airway ``` ## Three main parts of a `SummarizedExperiment` **Exercise** What are the `dim()` and `dimnames()` of the airway object? What do the rows and columns correspond to in terms of the original experiment? Hint: use `dplyr::glimpse()` to look at the first few elements of each `dimnames`. **Exercise** Use the functions `assay()`, `colData()`, `rowRanges()` to extract the three major components of the `airway` data set. `assay()` is large, so interogate it using `class()`, `dim()` and `head()` rather than simply printing it to the screen. What do each of these parts corresspond to in terms of the RNASeq experiment summarized in this object? ## The matrix-like behavior of `SummarizedExperiment` As a quick refresher, an _R_ `matrix` ```{r} set.seed(123) m <- matrix( rnorm(12), nrow = 4, ncol = 3, dimnames = list(letters[1:4], LETTERS[1:3]) ) m ``` has three main components to it's interface, `dim()`, `dimnames()`, and two-dimensional `[`-subsetting. ```{r} dim(m) dimnames(m) m[1:2, 2:1] m[1:2,] ``` A tricky, and unfortunate, behavior is that subsetting a single row or column of a matrix results, by default in the matrix being coerced to a vector ```{r} m[1,] m[,1] ``` Avoid this behavior using the `drop = FALSE` argument. ```{r} m[ 1, , drop = FALSE] ``` `SummarizedExperiment` implements the 'matrix-like' interface. This means that it supports the functions `dim()`, `dimnames()`, and two-dimensional `[`-subsetting. **Exercise** Subset `airway` to contain the first 5 rows and first 4 columns. Check out the `colData()`, `rowRanges()`, and `assay()` of the resulting object. ## Basic summaries of `assay()` values Returning to our simple matrix ```{r} m ``` It is straight-forward to transform a matrix, e.g., by adding 1 ```{r} m + 1 ``` or applying a transformation ```{r} abs(m + 1) ``` There are special functions for row- and column-wise operations that are often performed on matrices (check out the [matrixStats][] package for further mathematical operations), e.g., ```{r} colSums(m) rowMeans(m) ``` [matrixStats]: https://cran.r-project.org/package=matrixStats **Exercise** Extract the `assay()` component of `airway` and calculate the column sums. What do these numbers represent? **Exercise** Transform the `assay()` matrix by adding 1 and taking the natural `log()`, then create a vector representing the average of each row of this transformed matrix. What does this vector represent? Visualize it in a histogram, density plot, or other representation. ## Subsetting `SummarizedExperiment` We mentioned that the summarized experiment implements the matrix-like two-dimensional `[` subsetting interface. The `colData()` of `airway` contains information about experimental design, for instance the cell line and dexamethasone treatment level of each sample. ```{r} colData(airway) ``` There's a short-cut to accessing each column, e.g., ```{r} airway$dex ``` **Exercies** Use this shortcut to quickly subset `airway` so that it has only the `untrt` samples present. Verify that the `assay()` data have also been subset correctly. When working with a matrix, common scenario is to manipulate groups of rows based on some criterion, e.g., if one wished to keep rows with mean above 0, one might ```{r} ridx <- rowMeans(m) > 0 m[ridx, , drop = FALSE] ``` **Exercise** Remove all the rows from `airway` that had no gene expression across all samples. How many rows had no expression? Re-calculate the histogram of log-transoformed average expression with these rows excluded. ## The list-like interface of `GRangesList` An _R_ `list()` is an object that contain different types of vectors, including other lists. Here's a simple example ```{r} l <- list(a = 1:5, b = month.abb) ``` The interface to a list allows for `length()`, `[` (to return a subset of the original list) and `[[` (to extract a single element of the list, either by name or by position). The elements of the list can, but do not have to be, named. ```{r} names(l) length(l) l[c(2, 1)] l[2] # list of length 1, containing element 2 of original list l[[2]] # element 2 of original list ``` One useful function is `lengths()`, which returns a vector of the length of each element in the lsit ```{r} lengths(l) ``` The [IRanges][], [S4Vectors][], and [GenomicRanges][] packages implement several classes that implement this list-like interface; the classes all end with name `*List`, e.g., `GRangesList`. **Exercise** Extract the row ranges from `airway`, use `class()` to discover the type of object. What does each element of the list represent? What does the entire list represent? ```{r} r <- rowRanges(airway) ``` **Exercise** Prove to yourself that the object implements a list-like interface, supporting `length()`, `[` and `[[` subsetting. **Exercise** Use `lengths()` to determine the number of exons in each gene. How many genes are there with a single exon? What gene has the largest number of exons? ## Provenance in `GRangesList` `GRanges` / `GRangesList` objects contain provenance information about the genome to which the ranges refer, accessible using `seqinfo()`. **Exercise** Use `seqinfo(r)` to extract the information about the sequences the genomic ranges are based on. **Exercise** Note that the `genome` is always `NA`. Take a quick look at the vignette `browseVignettes("airway")`, scaning down to the 'Aligning reads' section. We're told that the reference genome used for alignments is "GRCh37". Update the row range to contain this information ```{r} genome(r) <- "GRCh37" seqinfo(r) ``` Update the `rowRanges()` of `airway` with this more complete information. ```{r} rowRanges(airway) <- r ``` Note that `SummarizedExperiment` has convenience functions that allow direct access to provenance ```{r} seqinfo(airway) ``` **Exercise** Use `as()` to coerce the sequence info to a `GRanges` object, and select the genomic range corressponding to chromosome 14. ```{r, message = FALSE} library(dplyr) # %>% ``` ```{r} chr14 <- seqinfo(r) %>% as("GRanges") %>% subset(seqnames == "14") ``` ## Subsetting by overlaps The `%over%` function returns TRUE for each range on the left of the operator that overlaps a range on the right of the operator. Thus the row ranges overlapping chromosome 14 are ```{r} idx <- r %over% chr14 r[idx] ``` **Exercise** How many ranges of `r` overlap chromosome 14? **Exercise** Use `idx` to subset `airway` to contain only genes on chromosome 14. **Exercise** We took a long route to get to this subset-by-overlap; summarize our story with two lines of code that allow us to subset `airway` by overlap. # Constructing a `SummarizedExperiment` object 'by hand' ```{r, message = FALSE} library(readr) library(tibble) ``` ```{r, eval = FALSE} pdata_file <- file.choos() # airway-sample-sheet.csv counts_file <- file.choose() # airway-read-counts.csv ``` ```{r, echo = FALSE} pdata_file <- system.file(package="BiocIntro", "extdata", "airway-sample-sheet.csv") counts_file <- system.file(package="BiocIntro", "extdata", "airway-read-counts.csv") ``` **Exercise** Use `read_csv()` to import the sample sheet and read counts. ```{r} pdata <- read_csv(pdata_file) counts <- read_csv(counts_file) ``` **Exercise** Assemble `SummarizedExperiment` from transposed `counts` and `pdata`. The `"Run"` column needs to be made into rownames on both the `pdata` and `counts` object. ```{r} pdata <- column_to_rownames(pdata, "Run") counts <- column_to_rownames(counts, "Run") se <- SummarizedExperiment(t(counts), colData = pdata) se ``` **Exercise** Following examples from the lectures, calculate total library size and plot average log gene expression from you newly-created `SummarizedExperiment`. # Provenance ```{r} sessionInfo() ``` [IRanges]: https://bioconductor.org/packages/IRanges [S4Vectors]: https://bioconductor.org/packages/S4Vectors [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment [MultiAssayExperiment]: https://bioconductor.org/packages/MultiAssayExperiment [airway]: https://bioconductor.org/packages/airway