Contents

1 Introduction

Description: This workshop will introduce you to the Bioconductor collection of R packages for statistical analysis and comprehension of high-throughput genomic data. The emphasis is on data exploration, using RNA-sequence gene expression experiments as a motivating example. How can I access common sequence data formats from R? How can I use information about gene models or gene annotations in my analysis? How do the properties of my data influence the statistical analyses I should perform? What common workflows can I perform with R and Bioconductor? How do I deal with very large data sets in R? These are the sorts of questions that will be tackled in this workshop.

Requirements: You will need to bring your own laptop. The workshop will use cloud-based resources, so your laptop will need a web browser and WiFi capabilities. Participants should have used R and RStudio for tasks such as those covered in introductory workshops earlier in the week. Some knowledge of the biology of gene expression and of concepts learned in a first course in statistics will be helpful.

Relevance: This workshop is relevant to anyone eager to explore genomic data in R. The workshop will help connect ‘core’ R concepts for working with data (e.g., data management via data.frame(), statistical modelling with lm() or t.test(), visualization using plot() or ggplot()) to the special challenges of working with large genomic data sets. It will be especially helpful to those who have or will have their own genomic data, and are interested in more fully understanding how to work with it in R.

1.1 Our goal

RNA-seq

  • Designed experiment, e.g., 8 samples from four cell lines exposed to two treatments (based on Himes et al., PMID: 24926665; details in the airway package vignette).
              cell   dex
SRR1039508  N61311 untrt
SRR1039509  N61311   trt
SRR1039512 N052611 untrt
SRR1039513 N052611   trt
SRR1039516 N080611 untrt
SRR1039517 N080611   trt
SRR1039520 N061011 untrt
SRR1039521 N061011   trt
  • Library preparation: mRNA to stable double-stranded DNA
  • DNA sequencing of ‘short’ mRNA-derived fragments
  • Alignment to a reference genome or transcriptome

source: http://bio.lundberg.gu.se/courses/vt13/rnaseq.html

  • End result: a matrix of ‘counts’ – reads aligning to genes across samples.
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
...                    ...        ...        ...        ...        ...
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000005          0          0          0
ENSG00000000419        799        417        508
...                    ...        ...        ...

Research question

  • Which gene counts are most different between dexamethasone untrt and trt experimental treatments?
  • We’ll try to understand how to accomplish this, without going into statistical details.

Our goal

  • Visualize 20 most differentially expressed genes as a heatmap.

2 Data gathering, input, representation, and cleaning

2.1 Base R data structures

Sample information

  • Simple ‘tab-separated value’ text file, e.g., from Excel export.
  • Input using base R command read.table()
  • ‘Atomic’ vectors, e.g., integer()
  • factor() and NA
  • data.frame(): coordinated management
    • Column access with $
    • Subset with [ , ]
samples_file <-
    system.file(package="BiocIntro", "extdata", "samples.tsv")
samples <- read.table(samples_file)
samples
##               cell   dex avgLength
## SRR1039508  N61311 untrt       126
## SRR1039509  N61311   trt       126
## SRR1039512 N052611 untrt       126
## SRR1039513 N052611   trt        87
## SRR1039516 N080611 untrt       120
## SRR1039517 N080611   trt       126
## SRR1039520 N061011 untrt       101
## SRR1039521 N061011   trt        98

samples$dex <- relevel(samples$dex, "untrt")

Counts

  • Another tsv file. Many rows, so use head() to view the first few.
  • Row names: gene identifiers. Column names: sample identifiers.
  • All columns are the same (numeric) type; represent as a matrix() rather than data.frame()
counts_file <-
    system.file(package="BiocIntro", "extdata", "counts.tsv")
counts <- read.table(counts_file)
dim(counts)
## [1] 63677     8

head(counts)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0

counts <- as.matrix(counts)

2.2 Genomic ranges (GRanges)

Row annotations.

  • ‘GTF’ files contain information about gene models.
  • The GTF file relevant to this experiment – same organism (Homo sapiens), genome (GRCh37) and gene model annotations (Ensembl release-75) as used in the alignment and counting step – is
url <- "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz"
  • Use BiocFileCache to download the resource once to a location that persists across R sessions.
library(BiocFileCache)
gtf_file <- bfcrpath(rnames = url)
## Using temporary cache /var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//RtmpLUzYG5/BiocFileCache
## adding rname 'ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz'
library(rtracklayer)
gtf <- import(gtf_file)
  • A GRanges object

    • Range-specific information
    • Annotations on each range
    • Use functions to access core elements: seqnames() (e.g., chromosome), start() / end() / width(), strand(), etc.
    • Use $ or mcols()$ to access annotations on ranges.
    • Bioconductor conventions: 1-based, closed intervals (like Ensembl) rather than 0-based, 1/2 open intervals (like UCSC).
  • Filter the information to gene-level annotations, keeping only some of the information about each genomic range. Use the gene_id column as names().
rowidx <- gtf$type == "gene"
colidx <- c("gene_id", "gene_name", "gene_biotype")
genes <- gtf[rowidx, colidx]
names(genes) <- genes$gene_id
genes$gene_id <- NULL

genes
## GRanges object with 63677 ranges and 2 metadata columns:
##                   seqnames      ranges strand |   gene_name   gene_biotype
##                      <Rle>   <IRanges>  <Rle> | <character>    <character>
##   ENSG00000223972        1 11869-14412      + |     DDX11L1     pseudogene
##   ENSG00000227232        1 14363-29806      - |      WASH7P     pseudogene
##   ENSG00000243485        1 29554-31109      + |  MIR1302-10        lincRNA
##   ENSG00000237613        1 34554-36081      - |     FAM138A        lincRNA
##   ENSG00000268020        1 52473-54936      + |      OR4G4P     pseudogene
##               ...      ...         ...    ... .         ...            ...
##   ENSG00000198695       MT 14149-14673      - |      MT-ND6 protein_coding
##   ENSG00000210194       MT 14674-14742      - |       MT-TE        Mt_tRNA
##   ENSG00000198727       MT 14747-15887      + |      MT-CYB protein_coding
##   ENSG00000210195       MT 15888-15953      + |       MT-TT        Mt_tRNA
##   ENSG00000210196       MT 15956-16023      - |       MT-TP        Mt_tRNA
##   -------
##   seqinfo: 265 sequences from an unspecified genome; no seqlengths

2.3 Coordinated management (SummarizedExperiment)

Three different data sets

  • counts: results of the RNAseq workflow
  • samples: sample and experimental design information
  • genes: information about the genes that we’ve assayed.

Coordinate our manipulation

  • Avoid ‘bookkeeping’ errors when, e.g., we subset one part of the data in a way different from another.
  • Use the SummarizedExperiment package and data representation.

    • Two-dimensional structure, so subset with [,]
    • Use functions to access components: assay(), rowData(), rowRanges(), colData(), etc.

library(SummarizedExperiment)
  • Make sure the order of the samples rows match the order of the samples in the columns of the counts matrix, and the order of the genes rows match the order of the rows of the counts matrix.
  • Create a SummarizedExperiment to coordinate our data manipulation.
samples <- samples[colnames(counts),]
genes <- genes[rownames(counts),]
se <- SummarizedExperiment(
  assays = list(counts = counts),
  rowRanges = genes, colData = samples
)

se
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(0):
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(2): gene_name gene_biotype
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): cell dex avgLength

3 Analysis & visualization

3.1 Differential expression analysis

Gestalt

  • Perform a t.test() for each row of the count matrix, asking whether the trt samples have on average counts that differ from the untrt samples.
  • Many nuanced statistical issues

The DESeq2 pacakge

  • Implements efficient, ‘correct’, robust algorithms for performing RNA-seq differential expression analysis of moderate-sized experiments.
library(DESeq2)
  • Specify our experimental design, perform the analysis taking account of the nuanced statistical issues, and get a summary of the results. The details of this step are beyond the scope of this workshop.
dds <- DESeqDataSet(se, ~ cell + dex)
fit <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
destats <- results(fit)

destats
## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 63677 rows and 6 columns
##                          baseMean      log2FoldChange             lfcSE
##                         <numeric>           <numeric>         <numeric>
## ENSG00000000003  708.602169691234   -0.38125388742934 0.100654430181804
## ENSG00000000005                 0                  NA                NA
## ENSG00000000419  520.297900552084   0.206812715390398 0.112218674568195
## ENSG00000000457  237.163036796015  0.0379205923946151  0.14344471633862
## ENSG00000000460  57.9326331250967 -0.0881676962628265 0.287141995236272
## ...                           ...                 ...               ...
## ENSG00000273489 0.275899382507797    1.48372584344306  3.51394515550546
## ENSG00000273490                 0                  NA                NA
## ENSG00000273491                 0                  NA                NA
## ENSG00000273492 0.105978355992386  -0.463691271907546  3.52308373749196
## ENSG00000273493 0.106141666408122  -0.521381077922898  3.53139001322807
##                               stat               pvalue                padj
##                          <numeric>            <numeric>           <numeric>
## ENSG00000000003  -3.78775069056286 0.000152017272514002 0.00128292609656079
## ENSG00000000005                 NA                   NA                  NA
## ENSG00000000419   1.84294384322566   0.0653372100662581   0.196469601297369
## ENSG00000000457   0.26435684326705    0.791504962999781    0.91141814384918
## ENSG00000000460 -0.307052600196215     0.75880333554496   0.895006448013164
## ...                            ...                  ...                 ...
## ENSG00000273489  0.422239328669782    0.672850337762336                  NA
## ENSG00000273490                 NA                   NA                  NA
## ENSG00000273491                 NA                   NA                  NA
## ENSG00000273492 -0.131615171950935    0.895288684444562                  NA
## ENSG00000273493 -0.147641884914972     0.88262539793309                  NA
  • Add the results to our SummarizedExperiment, so that we can manipulate these in a coordianted fashion too.
rowData(se) <- cbind(rowData(se), destats)

3.2 Heatmap

  • Use order() and head() to identify the row indexes of the top 20 most differentially expressed (based on adjusted P-value) genes.
  • Subset the our SummarizedExperiment to contain just these rows.
  • Dispaly the assay() of our subset as a heatmap
top20idx <- head( order(rowData(se)$padj), 20)
top20 <- se[top20idx,]
heatmap(assay(top20))