--- title: "Tutorial: Introduction to High Throughput DNA Sequence Data Analysis Using R / Bioconductor" author: Martin Morgan date: "Tuesday, 8 March, 2016" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{Tutorial: Introduction to High Throughput DNA Sequence Data Analysis Using R / Bioconductor} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} options(width=100) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r setup, echo=FALSE} suppressPackageStartupMessages({ library(ENAR2016) library(ggplot2) library(airway) library(DESeq2) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## library(BSgenome.Hsapiens.UCSC.hg19) library(AnnotationHub) }) ``` # Introduction ## _R_ - [r-project.org][]: "A free software environment for statistical computing and graphics" Functions ```{r} rnorm(10) # sample 10 random normal deviates ``` - zero or more arguments - named or unnames `rnorm(10)` the same as `rnorm(n=10)` - required or optional, e.g., `rnorm(n=10, mean=0, sd=1)` - Help via `?rnorm` Vectors - logical (`c(TRUE, FALSE)`), integer (`1:5`), numeric (`rnorm(10)`), character (`c("alpha", "beat")`), complex - statistical concepts: missing values (`NA`), factors (`factor(c("Female", "Male"))`), formulas (`y ~ x`) - vectorized ```{r vectors} x <- rnorm(1000) y <- x + rnorm(1000) plot(y ~ x) ``` Objects: classes and methods - coordinated manipulation of related data - simple _classes_, e.g., `data.frame()`, `matrix()` - more complicated classes, e.g., `lm()` - manipulation via _methods_ ```{r classes} df <- data.frame(Y = y, X = x) fit <- lm(Y ~ X, df) plot(Y ~ X, df) abline(fit, col="red", lwd=2) anova(fit) ``` Packages - base & recommended - e.g., `stats`, `graphics`, `Matrix` - contributed - e.g., `ggplot2`, `dplyr`, `ade` - discovery: [CRAN][] landing pages and task views - installation: `install.packages()` (or `biocLite()`, below) ```{r ggplot2, warning=FALSE} library(ggplot2) ggplot(df, aes(x=X, y=Y)) + geom_point() + geom_smooth(color="blue") + geom_smooth(method="lm", color="red") ``` Help! - `help.start()` - `?rnorm` - `class(fit)`; `methods(plot)`; `methods(class=class(fit))` - `?"plot...` ## _Bioconductor_ - [bioconductor.org][]: "Analysis and comprehension of high-throughput genomic data" Biological domains - Sequence analysis -- RNA-seq, ChIP-seq, SNPs, structural variants, ... - Microarrays -- expression, methylation, copy number, ... - Flow cytometry; proteomics; ... Principles - Interoperability - Formal approach to data -- 'S4' object system - Example: DNA sequences represented as a `DNAStringSet()`, rather than character vector. - Reproducibility - Hallmark of scientific endeavor - Example: Package vignettes - Example: 'release' (bug fixes only; end user oriented) and 'devel' (new features, new packages; developer playground) versions Installation and use - Discovery: [biocViews][], package landing pages (e.g., [DESeq2][]) - Installation: `biocLite("DESeq2")`; also works for CRAN & github packages - Help! -- [support.bioconductor.org][] ## Sequence Analysis Six stages - Experimental design - Wet-lab preparation - Sequencing - Alignment / Reduction - Statistical analysis - Comprehension ![](figures/SequencingEcosystem.png) Key packages: data access - [Biostrings][]: DNA sequences - [ShortRead][]: FastQ - [GenomicAlignments][] / [Rsamtools][]: BAM alignments - [VariantAnnotation][]: VCF files - [rtracklayer][]: BED, GFF / GTF, BigWig, ... Coordinated management: [SummarizedExperiment][] - Matrix of feature x sample assay(s) - Column data descrbing samples - Row ranges descrbing features # Case Study: RNA-Seq Differential Expression This is derived from: [RNA-Seq workflow][]: gene-level exploratory analysis and differential expression, by Michael Love, Simon Anders, Wolfgang Huber; modified by Martin Morgan, October 2015. [RNA-Seq workflow]: http://bioconductor.org/help/workflows/rnaseqGene/ We walk through an end-to-end RNA-Seq differential expression workflow, using [DESeq2][] along with other _Bioconductor_ packages. The complete work flow starts from the FASTQ files, but we will start after reads have been aligned to a reference genome and reads overlapping known genes have been counted. Our main focus is on differential gene expression analysis with [DESeq2][]. Other _Bioconductor_ packages are important in statistical inference of differential expression at the gene level, including [Rsubread][], [edgeR][], [limma][], [BaySeq][], and others. The data are from an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is: Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). ## Preparation Counts - A feature x sample matrix of **raw** count data, typically counts of reads overlapping features - Derived from sequence reads - Via alignment and summarization (via [Rsubread][], [SummarizedExperiment][] `summarizeOverlaps()`, etc.) - Directly from sequence data via *kallisto* or *sailfish*, using [tximport][] (in Bioc-devel, soon in release!) [kallisto]: http://pachterlab.github.io/kallisto `SummarizedExperiment` ```{r} library(airway) data("airway") se <- airway ``` Assay data ```{r} head(assay(se)) ``` - Features x samples - Raw counts - Row-wise tests Experimental design ```{r} colData(se) design = ~ cell + dex se$dex <- relevel(se$dex, "untrt") # 'untrt' as reference level ``` - Covariate (`cell`) plus treatment (`dex`) - _R_'s formula interface allows for complicated designs, contrasts, etc Features (genes) ```{r} rowRanges(se) ``` - Genomic coordinates of gene models ## Analysis Pipeline Create `DESeqDataSet` object ```{r} library(DESeq2) dds <- DESeqDataSet(se, design = ~ cell + dex) ``` Run the pipeline ```{r} dds <- DESeq(dds) ``` - Estimate size factors ('library size') - Estimate dispersion coeficient of negative binomial model - Fit GLM using negative binomial error - Wald test Summarize results ```{r} res <- results(dds) head(res) mcols(res) # what does each column mean? head(res[order(res$padj),]) # 'top table' by p-adj ``` ## Some Considerations ### Experimental design Keep it simple - Classical experimental designs - Time series - Without missing values, where possible - Intended analysis must be feasbile -- can the available samples and hypothesis of interest be combined to formulate a testable statistical hypothesis? Replicate - Extent of replication determines nuance of biological question. - No replication (1 sample per treatment): qualitative description with limited statistical options. - 3-5 replicates per treatment: designed experimental manipulation with cell lines or other well-defined entities; 2-fold (?) change in average expression between groups. - 10-50 replicates per treatment: population studies, e.g., cancer cell lines. - 1000's of replicates: prospective studies, e.g., SNP discovery - One resource: [RNASeqPower][] Avoid confounding experimental factors with other factors - Common problems: samples from one treatment all on the same flow cell; samples from treatment 1 processed first, treatment 2 processed second, etc. Be aware of _batch effects_ - Known - Phenotypic covariates, e.g., age, gender - Experimental covariates, e.g., lab or date of processing - Incorporate into linear model, at least approximately - Unknown - Or just unexpected / undetected - Characterize using, e.g., [sva][] - Surrogate variable analysis - Leek et al., 2010, Nature Reviews Genetics 11 [733-739](http://www.nature.com/nrg/journal/v11/n10/abs/nrg2825.html), Leek & Story PLoS Genet 3(9): [e161](http://dx.doi.org/10.1371/journal.pgen.0030161). - Scientific finding: pervasive batch effects - Statistical insights: surrogate variable analysis: identify and build surrogate variables; remove known batch effects - Benefits: reduce dependence, stabilize error rate estimates, and improve reproducibility - _combat_ software / [sva][] _Bioconductor_ package ![](figures/nrg2825-f2.jpg) HapMap samples from one facility, ordered by date of processing. ### Wet-lab Confounding factors - Record or avoid Artifacts of your _particular_ protocols - Sequence contaminants - Enrichment bias, e.g., non-uniform transcript representation. - PCR artifacts -- adapter contaminants, sequence-specific amplification bias, ... ### Sequencing Axes of variation - Single- versus paired-end - Length: 50-200nt - Number of reads per sample Application-specific, e.g., - RNA-seq, known genes: single- or paired-end reads - RNA-seq, transcripts or novel variants: paired-end reads - ChIP-seq: short, single-end reads are usually sufficient - Copy number: single- or paired-end reads - Structural variants: paired-end reads - Variants: depth via longer, paired-end reads - Microbiome: long paired-end reads (overlapping ends) ### Alignment / Reduction Alignment - _de novo_ - No reference genome; considerable sequencing and computational resources - Genome - Established reference genome - Splice-aware aligners - Novel transcript discovery - Transcriptome - Established reference genome; reliable gene model - Simple aligners - Known gene / transcript expression Splice-aware aligners (and _Bioconductor_ wrappers) - [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2) ([Rbowtie][]) - [STAR](http://bowtie-bio.sourceforge.net/bowtie2) ([doi](http://dx.doi.org/10.1093/bioinformatics/bts635)) - [subread](http://dx.doi.org/10.1093/nar/gkt214) ([Rsubread][]) - Systematic evaluation (Engstrom et al., 2013, [doi](http://dx.doi.org/10.1038/nmeth.2722)) Reduction to 'count tables' - Use known gene model to count aligned reads overlapping regions of interest / gene models - Gene model can be public (e.g., UCSC, NCBI, ENSEMBL) or _ad hoc_ (gff file) - `GenomicAlignments::summarizeOverlaps()` - `Rsubread::featureCount()` - [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html), [htseq-count](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) #### (Bowtie2 / tophat / Cufflinks / Cuffdiff / etc) - [tophat](http://ccb.jhu.edu/software/tophat) uses Bowtie2 to perform basic single- and paired-end alignments, then uses algorithms to place difficult-to-align reads near to their well-aligned mates. - [Cufflinks](http://cole-trapnell-lab.github.io/cufflinks/) ([doi](http://dx.doi.org/10.1038/nprot.2012.016)) takes _tophat_ output and estimate existing and novel transcript abundance. [How Cufflinks Works](http://cole-trapnell-lab.github.io/cufflinks/papers) - [Cuffdiff](http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/) assesses statistical significance of estimated abundances between experimental groups - [RSEM](http://www.biomedcentral.com/1471-2105/12/323) includes de novo assembly and quantification #### (kallisto / sailfish) - 'Next generation' differential expression tools; transcriptome alignment - E.g., [kallisto](http://pachterlab.github.io/kallisto) takes a radically different approach: from FASTQ to count table without BAM files. - Very fast, almost as accurate. - Hints on [how it works](https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/); [arXiv](http://arxiv.org/abs/1505.02710) - Integration with gene-level analyses -- [Soneson et al][]. [Soneson et al]: http://f1000research.com/articles/4-1521/v1 ### Statistical Considerations Unique statistical aspects - Large data, few samples - Comparison of each gene, across samples; _univariate_ measures - Each gene is analyzed by the _same_ experimental design, under the _same_ null hypothesis Summarization - Counts _per se_, rather than a summary (RPKM, FPKM, ...), are relevant for analysis - For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative. - Comparison is across samples at _each_ region of interest; all samples have the same region of interest, so modulo library size there is no need to correct for, e.g., gene length or mapability. Normalization - Libraries differ in size (total counted reads per sample) for un-interesting reasons; we need to account for differences in library size in statistical analysis. - Total number of counted reads per sample is _not_ a good estimate of library size. It is un-necessarily influenced by regions with large counts, and can introduce bias and correlation across genes. Instead, use a robust measure of library size that takes account of skew in the distribution of counts (simplest: trimmed geometric mean; more advanced / appropriate encountered in the lab). - Library size (total number of counted reads) differs between samples, and should be included _as a statistical offset_ in analysis of differential expression, rather than 'dividing by' the library size early in an analysis. - Detail - [DESeq2][] `estimateSizeFactors()`, Anders and Huber, [2010](http://genomebiology.com/2010/11/10/r106) - For each gene: geometric mean of all samples. - For each sample: median ratio of the sample gene over the geometric mean of all samples - Functions other than the median can be used; control genes can be used instead Appropriate error model - Count data is _not_ distributed normally or as a Poisson process, but rather as negative binomial. - Result of a combination Poisson (`shot' noise, i.e., within-sample technical and sampling variation in read counts) with variation between biological samples. - A negative binomial model requires estimation of an additional parameter ('dispersion'), which is estimated poorly in small samples. - Basic strategy is to moderate per-gene estimates with more robust local estimates derived from genes with similar expression values (a little more on borrowing information is provided below). Pre-filtering - Naively, a statistical test (e.g., t-test) could be applied to each row of a counts table. However, we have relatively few samples (10's) and very many comparisons (10,000's) so a naive approach is likely to be very underpowered, resulting in a very high _false discovery rate_ - A simple approach is perform fewer tests by removing regions that could not possibly result in statistical significance, regardless of hypothesis under consideration. - Example: a region with 0 counts in all samples could not possibly be significant regradless of hypothesis, so exclude from further analysis. - Basic approaches: 'K over A'-style filter -- require a minimum of A (normalized) read counts in at least K samples. Variance filter, e.g., IQR (inter-quartile range) provides a robust estimate of variability; can be used to rank and discard least-varying regions. - More nuanced approaches: `r Biocpkg("edgeR")` vignette; work flow today. Borrowing information - Why does low statistical power elevate false discovery rate? - One way of developing intuition is to recognize a t-test (for example) as a ratio of variances. The numerator is treatment-specific, but the denominator is a measure of overall variability. - Variances are measured with uncertainty; over- or under-estimating the denominator variance has an asymmetric effect on a t-statistic or similar ratio, with an underestimate _inflating_ the statistic more dramatically than an overestimate deflates the statistic. Hence elevated false discovery rate. - Under the null hypothesis used in microarray or RNA-seq experiments, the expected overall variability of a gene is the same, at least for genes with similar average expression - The strategy is to estimate the denominator variance as the between-group variance for the gene, _moderated_ by the average between-group variance across all genes. - This strategy exploits the fact that the same experimental design has been applied to all genes assayed, and is effective at moderating false discovery rate. ## Comprehension: placing differentially expressed regions in context - Gene names associated with genomic ranges - Gene set enrichment and similar analysis - Proximity to regulatory marks - Integrate with other analyses, e.g., methylation, copy number, variants, ... 'Annotation' packages - 'org': map between gene identifiers; also [GO.db][], [KEGGREST][] ```{r} library(org.Hs.eg.db) ttbl <- head(res[order(res$padj),]) # 'top table' by p-adj (ensid <- rownames(ttbl)) mapIds(org.Hs.eg.db, ensid, "SYMBOL", "ENSEMBL") select(org.Hs.eg.db, ensid, c("SYMBOL", "GENENAME"), "ENSEMBL") columns(org.Hs.eg.db) ``` - 'TxDb' ```{r} ## gene models, organized by entrez identifier library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## from Ensembl to Entrez identifiers egid <- mapIds(org.Hs.eg.db, ensid, "ENTREZID", "ENSEMBL") transcriptsBy(txdb, "gene")[egid] ``` - 'BSgenome': [BSgenome.Hsapiens.UCSC.hg19][] - [AnnotationHub][] - Many consortium-scale resources - Cached on disk - Imported in appropriate format ```{r, eval=FALSE} library(AnnotationHub) hub = AnnotationHub() query(hub, c("ensembl", "gtf", "release-83")) ``` # Challenges & Solutions ## Interoperability: Standard Representations Genomic Ranges: [GenomicRanges][] - Standard representation of genomic coordiantes - Describes annotations (exons, genes, transcripts, regulatory elements, ...) and data (peaks, reads, ...) - A _ton_ of functionality, especially `findOverlaps()` ```{r} gr <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100), strand=c("+", "-", "*"), score=c(10, 20, 30)) gr ``` Data / metadata integration: [SummarizedExperiment][] - Simple matrix, with more complicated row (feature) and column (sample) annotations. - Very easy to coordinate data manipulation -- avoid those embarassing 'off-by-one' and other errors! ```{r} library(airway) data(airway) airway ``` ## Scalability Efficient _R_ programming - _R_ works best when functions are _vectorized_, i.e., operating on vectors rather than iterating. - n.b., `*apply()` are forms of _iteration_, not vectorization - _R_ works best when memory is _pre-allocated_ - Worst: no pre-allocation, no vectorization; scales _quadratically_ with number of elements ```{r} result <- integer() for (i in 1:10) result[i] = sqrt(i) ``` - Better: pre-allocate and fill; scales _linearly_ with number of elements, but at the _interpretted_ level ```{r} result <- integer(10) for (i in 1:10) result[[i]] = sqrt(i) ### same as the much more compact, expressive, and robust... result <- sapply(1:10, sqrt) ``` - Best: vectorize; scales linearly, but at the _compiled_ level so 100x faster ```{r} result <- sqrt(1:10) ``` Parallel evaluation: [BiocParallel][] - _After_ writing efficient _R_ code - Very common paradigm: `lapply`-like - Different 'back-ends', e.g., cores on a computer; computers in a cluster; instances in a cloud ```{r bioc-parallel} library(BiocParallel) system.time({ res <- lapply(1:8, function(i) { Sys.sleep(1); i }) }) system.time({ res <- bplapply(1:8, function(i) { Sys.sleep(1); i }) }) ``` - In _Bioc-devel_ ```{r, eval=FALSE} X <- list(1, "2", 3) res <- bptry(bplapply(X, sqrt)) X <- list(1, 2, 3) res <- bplapply(X, sqrt, BPREDO=res) ``` ## Reproducibility: From Script to Package Scripts - ...Mature into re-usable functions - 'Fit in your head' - Poetry Vignettes - Like this document! - Classic -- LaTeX-based - Light weight / accessible: R-flavored markdown --- title: "A title" author: "An Author" vignette: > % \VignetteEngine{knitr::rmarkdown} --- # Heading ## Sub-heading Some text. ```{r r-code} x <- rnorm(1000) plot(x) ``` - Check out the [knitr][] package! Packages - Standard on-disk representation /MyPackage /MyPackage/DESCRIPTION /MyPackage/NAMESPACE /MyPackage/R/fun_one.R /MyPackage/R/fun_another.R /MyPackage/vignettes - Very easy to create - Very easy to share, e.g., github # Acknowledgments Individuals - Current and recent team members: Valerie Obenchain, Herve Pages, Dan Tenenbaum, Jim Hester, Jim Java, Brian Long, Sonali Arora, Nate Hayden, Paul Shannon - Technical Advisory Board: Vince Carey, Wolfgang Huber, Sean Davis, Michael Lawrence, Levi Waldron, Raphael Irizzary, Robert Gentleman - Scientific Advisory Board: Simon Tavare, Robert Gentleman, Vince Carey, Raphael Irizzary, Wolfgang Huber, Simon Urbanek. Annual conference: https://bioconductor.org/BioC2016 - Stanford, June 24 - 26 - Morning scientific lectures - Afternoon workshops - Social and other activities - Space filling up! ```{r sessionInfo} sessionInfo() ```

[r-project.org]: https://r-project.org [CRAN]: https://cran.r-project.org [bioconductor.org]: https://bioconductor.org [biocViews]: https://bioconductor.org/packages/release/BiocViews.html#___Software [support.bioconductor.org]: https://support.bioconductor.org [Annual conference]: http://bioconductor.org/BioC2016 [DESeq2]: https://bioconductor.org/packages/DESeq2 [Biostrings]: https://bioconductor.org/packages/Biostrings [ShortRead]: https://bioconductor.org/packages/ShortRead [GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments [Rsamtools]: https://bioconductor.org/packages/Rsamtools [VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation [rtracklayer]: https://bioconductor.org/packages/rtracklayer [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment [Rsubread]: https://bioconductor.org/packages/Rsubread [edgeR]: https://bioconductor.org/packages/edgeR [limma]: https://bioconductor.org/packages/limma [BaySeq]: https://bioconductor.org/packages/BaySeq [tximport]: https://bioconductor.org/packages/tximport [RNASeqPower]: https://bioconductor.org/packages/RNASeqPower [sva]: https://bioconductor.org/packages/sva [Rbowtie]: https://bioconductor.org/packages/Rbowtie [GO.db]: https://bioconductor.org/packages/GO.db [KEGGREST]: https://bioconductor.org/packages/KEGGREST [org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19 [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [BiocParallel]: https://bioconductor.org/packages/BiocParallel [knitr]: https://cran.r-project.org/package=knitr