--- title: "2. Case Studies" author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
19 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{2. Case Studies} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ``` ```{r packages, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE} suppressPackageStartupMessages({ library(BiocEMBO2015) }) ``` Version: `r packageDescription("BiocEMBO2015")$Version`
Compiled: `r date()` # _R_ data manipulation This case study servers as a refresher / tutorial on basic input and manipulation of data. Input a file that contains ALL (acute lymphoblastic leukemia) patient information ```{r echo=TRUE, eval=FALSE} fname <- file.choose() ## "ALLphenoData.tsv" stopifnot(file.exists(fname)) pdata <- read.delim(fname) ``` ```{r echo=FALSE} fname <- "ALLphenoData.tsv" stopifnot(file.exists(fname)) pdata <- read.delim(fname) ``` Check out the help page `?read.delim` for input options, and explore basic properties of the object you've created, for instance... ```{r ALL-properties} class(pdata) colnames(pdata) dim(pdata) head(pdata) summary(pdata$sex) summary(pdata$cyto.normal) ``` Remind yourselves about various ways to subset and access columns of a data.frame ```{r ALL-subset} pdata[1:5, 3:4] pdata[1:5, ] head(pdata[, 3:5]) tail(pdata[, 3:5], 3) head(pdata$age) head(pdata$sex) head(pdata[pdata$age > 21,]) ``` It seems from below that there are 17 females over 40 in the data set, but when sub-setting `pdata` to contain just those individuals 19 rows are selected. Why? What can we do to correct this? ```{r ALL-subset-NA} idx <- pdata$sex == "F" & pdata$age > 40 table(idx) dim(pdata[idx,]) ``` Use the `mol.biol` column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g., ```{r ALL-BCR/ABL-subset} bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),] ``` The `mol.biol` column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels? ```{r ALL-BCR/ABL-drop-unused} bcrabl$mol.biol <- factor(bcrabl$mol.biol) ``` The `BT` column is a factor describing B- and T-cell subtypes ```{r ALL-BT} levels(bcrabl$BT) ``` How might one collapse B1, B2, ... to a single type B, and likewise for T1, T2, ..., so there are only two subtypes, B and T ```{r ALL-BT-recode} table(bcrabl$BT) levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1) table(bcrabl$BT) ``` Use `xtabs()` (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups ```{r ALL-BCR/ABL-BT} xtabs(~ BT + mol.biol, bcrabl) ``` Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups. ```{r ALL-aggregate} aggregate(age ~ mol.biol + sex, bcrabl, mean) ``` Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? ```{r ALL-age} t.test(age ~ mol.biol, bcrabl) boxplot(age ~ mol.biol, bcrabl) ``` # Short read quality assessment Option 1: `fastqc` 1. Start _fastqc_ 2. Select fastq.gz files from the File --> Open menu. Files are in `/mnt/nfs/practicals/day1/martin_morgan/` 3. Press `OK` 4. Study plots and the Help -> Contents menu Option 2: [ShortRead][] ```{r ShortRead, messages=FALSE} ## 1. attach ShortRead and BiocParallel library(ShortRead) library(BiocParallel) ## 2. create a vector of file paths ## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/' fls <- dir("bigdata", pattern="*fastq.gz", full=TRUE) stopifnot(all(file.exists(fls))) ## 3. collect statistics stats <- qa(fls) ## 4. generate and browse the report browseURL(report(stats)) ``` Check out the qa report from all lanes ```{r ShortRead-qa-all} ## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/' load("bigdata/qa_all.Rda") browseURL(report(qa_all)) ``` # Annotation `org` packages - symbol mapping ```{r org} library(airway) data(airway) library(org.Hs.eg.db) ensid <- head(rownames(airway)) mapIds(org.Hs.eg.db, ensid, "SYMBOL", "ENSEMBL") keytypes(org.Hs.eg.db) ``` `TxDb` packages - known gene models, as _GRanges_ / _GRangesList_ - Easy to make your own, from GFF files `GenomicFeatures::makeTxDbFromGFF()` & friends ```{r txdb} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene exons(txdb) exonsBy(txdb, "tx") p <- promoters(txdb) ``` `BSgenome` - Whole-genome sequences - Possible to make your own, or use other formats -- `Rsamtools::FaFile()`, `tracklayer::TwoBitFile()` ```{r BSgenome} library(BSgenome.Hsapiens.UCSC.hg19) bsgenome <- BSgenome.Hsapiens.UCSC.hg19 ps <- getSeq(bsgenome, p) ps hist(letterFrequency(ps, "GC", as.prob=TRUE)) ``` [AnnotationHub][] - easily accessible genome-scale resources Example: Ensembl 'GTF' files to _R_ / _Bioconductor_ GRanges and _TxDb_ ```{r annotationhub-gtf, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() hub query(hub, c("Ensembl", "80", "gtf")) ## ensgtf = display(hub) # visual choice hub["AH47107"] gtf <- hub[["AH47107"]] gtf txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf) ``` Example: non-model organism `OrgDb` packages ```{r annotationhub-orgdb, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub, "OrgDb") ``` Example: Map Roadmap epigenomic marks to hg38 - Roadmap BED file as _GRanges_ ```{r annotationhub-roadmap, eval=FALSE} library(AnnotationHub) hub <- AnnotationHub() query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2")) E126 <- hub[["AH29817"]] ``` - UCSC 'liftOver' file to map coordinates ```{r annotationhub-liftover, eval=FALSE} query(hub , c("hg19", "hg38", "chainfile")) chain <- hub[["AH14150"]] ``` - lift over -- possibly one-to-many mapping, so _GRanges_ to _GRangesList_ ```{r liftover, eval=FALSE} library(rtracklayer) E126hg38 <- liftOver(E126, chain) E126hg38 ``` # Alignments Integrative Genomics Viewer 1. Create an 'igv' directory (if it does not already exist) and add the file hg19_alias.tab to it. This is a simple tab-delimited file that maps between the sequence names used by the alignment, and the sequence names known to IGV. 2. Start igv. 3. Choose hg19 from the drop-down menu at the top left of the screen 4. Use File -> Load from File menu to load a bam file, e.g., `/mnt/nfs/practicals/day1/martin_morgan/SRR1039508_sorted.bam` 5. Zoom in to a particular gene, e.g., SPARCL1, by entering the gene symbol in the box toward the center of the browser window. Adjust the zoom until reads come in to view, and interpret the result. ``` mkdir -p ~/igv/genomes cp bigdata/hg19_alias.tab ~/igv/genomes/ igv ``` _Bioconductor_: we'll explore how to map between different types of identifiers, how to navigate genomic coordinates, and how to query BAM files for aligned reads. 1. Attach 'Annotation' packages containing information about gene symbols `r Biocannopkg("org.Hs.eg.db")` and genomic coordinates (e.g., genes, exons, cds, transcripts) `r Biocannopkg(TxDb.Hsapiens.UCSC.hg19.knownGene)`. Arrange for the 'seqlevels' (chromosome names) in the TxDb package to match those in the BAM files. 2. Use the `org.*` package to map from gene symbol to Entrez gene id, and the `TxDb.*` package to retrieve gene coordinates of the SPARCL1 gene. N.B. -- The following uses a single gene symbol, but we could have used 1, 2, or all gene symbols in a _vectorized_ fashion. 3. Attach the [GenomicAlignments][] package for working with aligned reads. Use `range()` to get the genomic coordinates spanning the first and last exon of SPARCL1. Input paired reads overlapping SPARCL1. 4. What questions can you easily answer about these alignments? E.g., how many reads overlap this region of interest? ```{r setup-view, message=FALSE, warning=FALSE} ## 1.a 'Annotation' packages library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## 1.b -- map 'seqlevels' as recorded in the TxDb file to those in the ## BAM file fl <- "~/igv/genomes/hg19_alias.tab" map <- with(read.delim(fl, header=FALSE, stringsAsFactors=FALSE), setNames(V1, V2)) seqlevels(txdb, force=TRUE) <- map ## 2. Symbol -> Entrez ID -> Gene coordinates sym2eg <- mapIds(org.Hs.eg.db, "SPARCL1", "ENTREZID", "SYMBOL") exByGn <- exonsBy(txdb, "gene") sparcl1exons <- exByGn[[sym2eg]] ## 3. Aligned reads library(GenomicAlignments) ## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/' fl <- "bigdata/SRR1039508_sorted.bam" sparcl1gene <- range(sparcl1exons) param <- ScanBamParam(which=sparcl1gene) aln <- readGAlignmentPairs(fl, param=param) ``` 5. As another exercise we ask how many of the reads we've input are compatible with the known gene model. We have to find the transcripts that belong to our gene, and then exons grouped by transcript. ```{r compatibleAlignments, warning=FALSE} ## 5.a. exons-by-transcript for our gene of interest txids <- select(txdb, sym2eg, "TXID", "GENEID")$TXID exByTx <- exonsBy(txdb, "tx")[txids] ## 5.b compatible alignments hits <- findCompatibleOverlaps(query=aln, subject=exByTx) good <- seq_along(aln) %in% queryHits(hits) table(good) ``` 6. Finally, let's go from gene model to protein coding sequence. (a) Extract CDS regions grouped by transcript, select just transcripts we're interested in, (b) attach and then extract the coding sequence from the appropriate reference genome. Translating the coding sequences to proteins. ```{r coding-sequence, warning=FALSE} ## reset seqlevels restoreSeqlevels(txdb) ## a. cds coordinates, grouped by transcript txids <- mapIds(txdb, sym2eg, "TXID", "GENEID") cdsByTx <- cdsBy(txdb, "tx")[txids] ## b. coding sequence from relevant reference genome library(BSgenome.Hsapiens.UCSC.hg19) dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx) protein <- translate(dna) ``` # _biomaRt_ annotations **Exercises** Visit the [biomart](http://biomart.org) web service to explore the diversity of annotation offerings available. Load the [biomaRt][] package and list the available marts. Choose the _ensembl_ mart and list the datasets for that mart. Set up a mart to use the _ensembl_ mart and the _hsapiens_gene_ensembl_ dataset. A [biomaRt][] dataset can be accessed via `getBM()`. In addition to the mart to be accessed, this function takes filters and attributes as arguments. Use `filterOptions()` and `listAttributes()` to discover values for these arguments. Call `getBM()` using filters and attributes of your choosing. **Solutions** ```{r biomaRt1, eval=FALSE, results="hide"} library(biomaRt) head(listMarts(), 3) ## list the marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" head(filterOptions(myFilter, ensembl), 3) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` [AnnotationHub]: https://bioconductor.org/packages/AnnotationHub [BiocParallel]: https://bioconductor.org/packages/BiocParallel [Biostrings]: https://bioconductor.org/packages/Biostrings [DESeq2]: https://bioconductor.org/packages/DESeq2 [GenomicFiles]: https://bioconductor.org/packages/GenomicFiles [GenomicRanges]: https://bioconductor.org/packages/GenomicRanges [GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments [ShortRead]: https://bioconductor.org/packages/ShortRead [biomaRt]: https://bioconductor.org/packages/biomaRt