Sequence Data Representations

useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora
Date: 30 June, 2014

High-throughput sequence data

Alt Files and the Bioconductor packages that input them

DNA / amino acid sequences: FASTA files

Input & manipulation: Biostrings

>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736
gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt
gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg
...
atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt
>NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454
ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg
cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg
...

Reads: FASTQ files

Input & manipulation: ShortRead readFastq(), FastqStreamer(), FastqSampler()

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

Aligned reads: BAM files (e.g., ERR127306_chr14.bam)

Input & manipulation: 'low-level' Rsamtools, scanBam(), BamFile(); 'high-level' GenomicAlignments

Called variants: VCF files

Input and manipulation: VariantAnnotation readVcf(), readInfo(), readGeno() selectively with ScanVcfParam().

Genome annotations: BED, WIG, GTF, etc. files

Input: rtracklayer import()

Data representation

Background: Ranges

Alt Ranges Algebra

Ranges

Intra-range methods

Inter-range methods

Between-range methods

Example

require(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # 1-based coordinates!
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [11, 15]      +
##   [2]        A  [21, 25]      +
##   [3]        A  [23, 27]      +
##   ---
##   seqlengths:
##     A
##    NA
range(gr)                               # intra-range
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 26]      +
##   ---
##   seqlengths:
##     A
##    NA
reduce(gr)                              # inter-range
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [10, 14]      +
##   [2]        A  [20, 26]      +
##   ---
##   seqlengths:
##     A
##    NA
coverage(gr)
## RleList of length 1
## $A
## integer-Rle of length 26 with 6 runs
##   Lengths: 9 5 5 2 3 2
##   Values : 0 1 0 1 2 1
setdiff(range(gr), gr)                  # 'introns'
## GRanges with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        A  [15, 19]      +
##   ---
##   seqlengths:
##     A
##    NA

IRangesList, GRangesList

Reference

Biostrings (DNA or amino acid sequences)

Classes

Methods –

Related packages

Example

GenomicAlignments (Aligned reads)

Classes – GenomicRanges-like behaivor

Methods

Example

VariantAnnotation (Called variants)

Classes – GenomicRanges-like behavior

Functions and methods

Example

Related packages

Reference

rtracklayer (Genome annotations)

Big data

Restriction

Iteration

Compression

Parallel processing

Reference

Exercises

Summarize overlaps

The goal is to count the number of reads overlapping exons grouped into genes. This type of count data is the basic input for RNASeq differential expression analysis, e.g., through DESeq2 and edgeR.

  1. Identify the regions of interest. We use a 'TxDb' package with gene models alreaddy defined

    require(TxDb.Hsapiens.UCSC.hg19.knownGene)
    exByGn <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
    ## only chromosome 14
    seqlevels(exByGn, force=TRUE) = "chr14"
    
  2. Identify the sample BAM files.

    require(RNAseqData.HNRNPC.bam.chr14)
    length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
    
    ## [1] 8
    
  3. Summarize overlaps, optionally in parallel

    ## next 2 lines optional; non-Windows
    library(BiocParallel)
    register(MulticoreParam(workers=detectCores()))
    olaps <- summarizeOverlaps(exByGn, RNAseqData.HNRNPC.bam.chr14_BAMFILES)
    
  4. Explore our handiwork, e.g., library sizes (column sums), relationship between gene length and number of mapped reads, etc.

    olaps
    
    ## class: SummarizedExperiment 
    ## dim: 779 8 
    ## exptData(0):
    ## assays(1): counts
    ## rownames(779): 10001 100113389 ... 9950 9985
    ## rowData metadata column names(0):
    ## colnames(8): ERR127306 ERR127307 ... ERR127304 ERR127305
    ## colData names(0):
    
    head(assay(olaps))
    
    ##           ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303
    ## 10001           103       139       109       125       152       168
    ## 100113389         0         0         0         0         0         0
    ## 100113391         0         0         0         0         0         0
    ## 100124539         0         0         0         0         0         0
    ## 100126297         0         0         0         0         0         0
    ## 100126308         0         0         0         0         0         0
    ##           ERR127304 ERR127305
    ## 10001           181       150
    ## 100113389         0         0
    ## 100113391         0         0
    ## 100124539         0         0
    ## 100126297         0         0
    ## 100126308         0         0
    
    colSums(assay(olaps))                # library sizes
    
    ## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 
    ##    340646    373268    371639    331518    313800    331135    331606 
    ## ERR127305 
    ##    329647
    
    plot(sum(width(olaps)), rowMeans(assay(olaps)), log="xy")
    
    ## Warning: 252 y values <= 0 omitted from logarithmic plot
    

    plot of chunk summarizeOverlaps-explore

  5. As an advanced exercise, investigate the relationship between GC content and read count

    require(BSgenome.Hsapiens.UCSC.hg19)
    sequences <- getSeq(BSgenome.Hsapiens.UCSC.hg19, rowData(olaps))
    gcPerExon <- letterFrequency(unlist(sequences), "GC")
    gc <- relist(as.vector(gcPerExon), sequences)
    gc_percent <- sum(gc) / sum(width(olaps))
    plot(gc_percent, rowMeans(assay(olaps)), log="y")
    
    ## Warning: 252 y values <= 0 omitted from logarithmic plot
    

    plot of chunk summarizeOverlaps-gc