These notes were created during the course, and server as a transcript of topics covered.

Intro to sequencing

Workflow

Experimental design
Wet lab sample prep, etc
Sequencing
- FASTQ file of reads and their quality scores
- Quality assessment (FASTQ program), trimming or removing contanimants, removing optical duplicates (FASTX, trimomatic)
- Quality with respect to your research question
Alignment / (assembly)
- BAM file of aligned reads to a known reference genome
- Aligners: vary from simple to use to hard to use, from ‘good enough’ alignments (for RNA-seq of known genes, ChIP-seq) to high-quality (e.g., DNA-seq calling variants)
- Bowtie2 (easy, good enough), gmap (excellent, hard to use).
- Purpose-built tools that align and reduce. E.g., RNA-seq known gene differential expression – kalisto, sailfish
Reduction
- BED of called peaks in a ChIP-seq experiment (e.g., MACS, FindPeaks)
- VCF of called variants (GATK, bcftools)
- Count table (e.g., tsv) in an RNA-seq experiment (python htseq2; GenomicFeatures::summarizeOverlaps())
(Statistical) analysis
- Why statistical analysis? data is fundamentally huge; biological questions are framed in terms of classical statistics, e.g., designed experiments, hypothesis testing; technical and other artifacts, e.g., GC bias, mapability, batch effects
- Appropriate tools: able to cope with statistics; access to advanced statistical methods; analysis has to be reproducible (some sort of scripting); processing large amounts of data is not the primary criterion.
- R / Bioconductor is the best most awesome tool.
Comprehension
- .Rmd or similar documenting the work flow, including inputs, analysis steps, tables, figures, interpertation…

FASTQ and BAM files

View from the Linux command line…

zcat *fastq.gz | less
samtools view -h *bam

… or within R / Bioconductor: fastq files

library(ShortRead)

## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Rsamtools
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)
fq

## class: ShortReadQ
## length: 100000 reads; width: 63 cycles

sread(fq)

##   A DNAStringSet instance of length 100000
##          width seq
##      [1]    63 CATTGCTGATACCAANNNNNNNNGCATTC...GTCTTCCTCCTTCCCTTACGGAATTACA
##      [2]    63 CCCTGGACTGCTTCTTGAAAAGTGCCATC...CTATCTTTGGGGAGAGTATGATAGAGAT
##      [3]    63 TCGATCCATCGATTGGAAGGCACTGATCT...TCAGGTTGGTGGTCTTATTTGCAAGTCC
##      [4]    63 GAAGAGTTAGCAGCGACCGTGACAGACCA...GCTCCCAACTCCAGGGTGCCAATCCGAT
##      [5]    63 CGTGCAGGAGATCATGATCCCCGCGGGCA...GCCTGGTCATTGGCAAGGGCGGGGAGAC
##      ...   ... ...
##  [99996]    63 GAGAGAAGCTTTGTATGGCTGTCATGCTT...TGATTCCTGCAACTTGACCTTCAGGCTG
##  [99997]    63 TTATGGTGCAGACATGGCCAAGTCCAAGA...CCACACACAACCAGTCCCGAAAATGGCA
##  [99998]    63 TTAAAGTAGAGCATCTAGTTTGAGAAATA...AATTATTAAAGATGTCTTTTTTCTACCC
##  [99999]    63 TCCCAACTGTAGGCTGAGTGACCTGAAGG...AGACTGCCGAAGTCCAAAAGCTTCAGCA
## [100000]    63 GTGTTTTCTGGTATCGTCCCTTCGTGGTT...AAAAAATGGTACTGGAAAGGGGTCCCAA

quality(fq)

## class: FastqQuality
## quality:
##   A BStringSet instance of length 100000
##          width seq
##      [1]    63 HJJJJJJJJJJJJJJ########00?GHI...JIJJJJJJJJJJJJJJJJJHHHFFFFFD
##      [2]    63 HJJJJJJJJJJJJJJJJJIIJIGHIJJJJ...JJJJJJJJJJJJGHHIDHIJJHHHHHHF
##      [3]    63 HJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...GHIJJBGIJCGIAHIJHHHHHHHFFFFF
##      [4]    63 HIJJJJIIJJJJJJJJJJJIJJJJJJJJJ...IHHHHHHFFFFEEEEDC@DDDDDDDDDD
##      [5]    63 HIGGIIIIIIIGHIIIGIHIIIIJGIFAC...@@DDBDDCCDECCDDDB?BBBBBD@B;<
##      ...   ... ...
##  [99996]    63 HJJJJJJJJJJJJGIJJJJJJGGIJJGHH...CHJJJGGHIJJJJJIJJJJJJJJIHHHH
##  [99997]    63 HJJJIJHHIIJJJJIJJJJJIJIJJIJJI...HHFFFFDDDDDDDDCDDDDD@DDDDDDD
##  [99998]    63 HJJJJJJHIJJJJJJJJJJJJJIJJJJJJ...JJJJJJJJJJJJJJJJJJJJJJJJJJIJ
##  [99999]    63 HJJJJJJJJHIJJJJJJJGHIJJJJJJJJ...JJJJJJJJJJJJIJJJJHHHHFFFFFFF
## [100000]    63 HAEFHIJJJJJHIJJJJJJJJJJIHIJFH...IJJJJJIJHHHHHHFFFFFEDD>BDDDD

R

Statistical programming language
Vectorized (works efficiently on vectors; vector notation is very expressive and compact)
Objects help to coordinate management of related data
Introspection helps discover what can be done with objects.

x = rnorm(1000)
y = x + rnorm(1000, sd=.5)
df = data.frame(x=x, y=y)
plot(y ~ x, df)

fit = lm(y ~ x, df)
class(fit)

## [1] "lm"

methods(class=class(fit))

##  [1] add1           alias          anova          case.names    
##  [5] coerce         confint        cooks.distance deviance      
##  [9] dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula       
## [17] hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print         
## [29] proj           qr             residuals      rstandard     
## [33] rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

methods("anova")

## [1] anova.glm*     anova.glmlist* anova.lm*      anova.lmlist* 
## [5] anova.loess*   anova.mlm*     anova.nls*    
## see '?methods' for accessing help and source code

Help!

?log
?plot    # generic 'plot'
?plot.lm # plot for objects of class 'lm'

Bioconductor

Main web site, including biocViews
Package landing pages, e.g., ChIPseeker
The support forum
1100+ packages for analysis and comprehension of high-throughput genomic data: sequencing (RNA, ChIP, variants, …), microarray (expression, methylation, copy number, etc), flow cytometry, proteomics, imaging, …

Extensive use of ‘S4’ classes

fit (from lm()) is an example of an S3 class
sread(fq) returned a DNAStringSet, an example of an S4 class

library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)          # 'ShortReadQ' S4 class
class(fq)                 # introspection

## [1] "ShortReadQ"
## attr(,"package")
## [1] "ShortRead"

methods(class=class(fq))

##  [1] [                 [<-               alphabetByCycle  
##  [4] alphabetScore     append            clean            
##  [7] coerce            detail            dustyScore       
## [10] id                length            narrow           
## [13] pairwiseAlignment qa                renew            
## [16] renewable         reverse           reverseComplement
## [19] show              srdistance        srduplicated     
## [22] sread             srorder           srrank           
## [25] srsort            tables            trimEnds         
## [28] trimLRPatterns    trimTails         trimTailw        
## [31] width             writeFasta        writeFastq       
## see '?methods' for accessing help and source code

reads = sread(fq)         # accessor -- get the reads
reads                     # 'DNAStringSet' S4 class

##   A DNAStringSet instance of length 100000
##          width seq
##      [1]    63 CATTGCTGATACCAANNNNNNNNGCATTC...GTCTTCCTCCTTCCCTTACGGAATTACA
##      [2]    63 CCCTGGACTGCTTCTTGAAAAGTGCCATC...CTATCTTTGGGGAGAGTATGATAGAGAT
##      [3]    63 TCGATCCATCGATTGGAAGGCACTGATCT...TCAGGTTGGTGGTCTTATTTGCAAGTCC
##      [4]    63 GAAGAGTTAGCAGCGACCGTGACAGACCA...GCTCCCAACTCCAGGGTGCCAATCCGAT
##      [5]    63 CGTGCAGGAGATCATGATCCCCGCGGGCA...GCCTGGTCATTGGCAAGGGCGGGGAGAC
##      ...   ... ...
##  [99996]    63 GAGAGAAGCTTTGTATGGCTGTCATGCTT...TGATTCCTGCAACTTGACCTTCAGGCTG
##  [99997]    63 TTATGGTGCAGACATGGCCAAGTCCAAGA...CCACACACAACCAGTCCCGAAAATGGCA
##  [99998]    63 TTAAAGTAGAGCATCTAGTTTGAGAAATA...AATTATTAAAGATGTCTTTTTTCTACCC
##  [99999]    63 TCCCAACTGTAGGCTGAGTGACCTGAAGG...AGACTGCCGAAGTCCAAAAGCTTCAGCA
## [100000]    63 GTGTTTTCTGGTATCGTCCCTTCGTGGTT...AAAAAATGGTACTGGAAAGGGGTCCCAA

methods(class=class(reads))

##  [1] !                         !=                       
##  [3] [                         [[                       
##  [5] [[<-                      [<-                      
##  [7] %in%                      <                        
##  [9] <=                        ==                       
## [11] >                         >=                       
## [13] $                         $<-                      
## [15] aggregate                 alphabetFrequency        
## [17] anyNA                     append                   
## [19] as.character              as.complex               
## [21] as.data.frame             as.env                   
## [23] as.integer                as.list                  
## [25] as.logical                as.matrix                
## [27] as.numeric                as.raw                   
## [29] as.vector                 c                        
## [31] chartr                    clean                    
## [33] coerce                    compact                  
## [35] compare                   compareStrings           
## [37] complement                consensusMatrix          
## [39] consensusString           countOverlaps            
## [41] countPattern              countPDict               
## [43] dinucleotideFrequencyTest do.call                  
## [45] droplevels                duplicated               
## [47] dustyScore                elementLengths           
## [49] elementMetadata           elementMetadata<-        
## [51] elementType               endoapply                
## [53] eval                      expand                   
## [55] extractAt                 extractROWS              
## [57] Filter                    Find                     
## [59] findOverlaps              hasOnlyBaseLetters       
## [61] head                      high2low                 
## [63] ifelse                    intersect                
## [65] is.na                     is.unsorted              
## [67] isEmpty                   isMatchingEndingAt       
## [69] isMatchingStartingAt      lapply                   
## [71] length                    lengths                  
## [73] letterFrequency           Map                      
## [75] match                    
##  [ reached getOption("max.print") -- omitted 102 entries ]
## see '?methods' for accessing help and source code

gc = letterFrequency(reads, "GC", as.prob=TRUE)
hist(gc)

Help!

?DNAStringSet      # class, and often frequently used methods
?letterFrequency   # generic
methods("letterFrequency")
?"letterFrequency,XStringSet-method"

And…

Key software packages…

ShortRead for FASTQ files
GenomicAlignments for aligned reads
VariantAnnotation for VCF files
rtracklayer import() to import BED, WIG, GFF, GTF, …, files
Gviz for visualization of genomic data; ReportTools for reports; shiny for interactive visualizations

… and classes

DNAStringSet, DNAString for sequence data
GRanges, GRangesList for representing coordinates in genome space
SummarizedExperiment (ExpressionSet): integrated data contains: rows x columns (features x samples)
- assays()
- rowRanges() for annotations on rows
- colData() for column annotations

Annotation

Pure ‘data’ packages
Identifier mapping org.* packages
Gene models with TxDb.* packages
Whole genome sequences BSgenome.* packages
biomaRt for accessing ENSEMBL-based biomarts; AnnotationHub for genome-scale annotation resources

Strategies for working with big data

Write efficient R code – vectorized
Process data in chunks, e.g., FastqStreamer(), Rsamtools::BamFile(..., yieldSize=1000000); GenomicFiles::reduceByYield() (see examples on ?reduceByYield)
Process in parallel BiocParallel

All material on the course materials page

Course Notes

Martin Morgan

19/10/2015

Intro to sequencing

FASTQ and BAM files

R

Bioconductor

And…