3 Core classes

3.1 Case study: IRanges and GRanges

The IRanges package defines an important class for specifying integer ranges, e.g.,

library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir

## IRanges of length 3
##     start end width
## [1]    10  14     5
## [2]    20  24     5
## [3]    30  34     5

There are many interesting operations to be performed on ranges, e.g, flank() identifies adjacent ranges

flank(ir, 3)

## IRanges of length 3
##     start end width
## [1]     7   9     3
## [2]    17  19     3
## [3]    27  29     3

The IRanges class is part of a class hierarchy. To see this, ask R for the class of ir, and for the class definition of the IRanges class

class(ir)

## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"

getClass(class(ir))

## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                                       
## Name:            start           width           NAMES     elementType elementMetadata
## Class:         integer         integer characterORNULL       character DataTableORNULL
##                       
## Name:         metadata
## Class:            list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges"

Notice that IRanges extends the Ranges class. Show

Now try entering ?flank (if not using RStudio, enter ?"flank,<tab>" where <tab> means to press the tab key to ask for tab completion). You can see that there are help pages for flank operating on several different classes. Select the completion

?"flank,Ranges-method"

and verify that you’re at the page that describes the method relevant to an IRanges instance. Explore other range-based operations.

The GenomicRanges package extends the notion of ranges to include features relevant to application of ranges in sequence analysis, particularly the ability to associate a range with a sequence name (e.g., chromosome) and a strand. Create a GRanges instance based on our IRanges instance, as follows

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The notion of flanking sequence has a more nuanced meaning in biology. In particular we might expect that flanking sequence on the + strand would precede the range, but on the minus strand would follow it. Verify that flank applied to a GRanges object has this behavior.

flank(gr, 3)

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Discover what classes GRanges extends, find the help page documenting the behavior of flank when applied to a GRanges object,

It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current search() path

methods(class="GRanges")

##   [1] aggregate           anyNA               <=                  <                  
##   [5] ==                  >=                  >                   !=                 
##   [9] append              as.character        as.complex          as.data.frame      
##  [13] as.env              as.factor           as.integer          as.list            
##  [17] as.logical          as.numeric          as.raw              BamViews           
##  [21] bamWhich<-          blocks              browseGenome        c                  
##  [25] chrom<-             chrom               coerce              coerce<-           
##  [29] compare             countOverlaps       coverage            disjoin            
##  [33] disjointBins        distance            distanceToNearest   duplicated         
##  [37] elementMetadata<-   elementMetadata     end<-               end                
##  [41] eval                expand              export              extractROWS        
##  [45] extractUpstreamSeqs findOverlaps        flank               follow             
##  [49] gaps                [<-                 [                   $<-                
##  [53] $                   getPromoterSeq      granges             head               
##  [57] high2low            %in%                intersect           isDisjoint         
##  [61] length              lengths             liftOver            mapCoords          
##  [65] mapFromAlignments   mapFromTranscripts  mapToAlignments     mapToTranscripts   
##  [69] match               mcols<-             mcols               metadata<-         
##  [73] metadata            mstack              names<-             names              
##  [77] narrow              nearest             NROW                Ops                
##  [81] order               overlapsAny         parallelSlotNames   parallelVectorNames
##  [85] pgap                pintersect          pmapCoords          pmapFromAlignments 
##  [89] pmapFromTranscripts pmapToAlignments    pmapToTranscripts   precede            
##  [93] promoters           psetdiff            punion              range              
##  [97] ranges<-            ranges              rank                reduce             
## [101] relistToClass       relist              rename              rep.int            
## [105] replaceROWS         rep                 resize              restrict           
## [109] rev                 ROWNAMES            rowRanges<-         ScanBamParam       
## [113] ScanBcfParam        scanFa              scanTabix           score<-            
## [117] score               seqinfo<-           seqinfo             seqlevelsInUse     
## [121] seqnames<-          seqnames            setdiff             shiftApply         
## [125] shift               showAsCell          show                sort               
## [129] split               split<-             start<-             start              
## [133] strand<-            strand              subsetByOverlaps    subset             
## [137] summarizeOverlaps   table               tail                tapply             
## [141] tile                trim                union               unique             
## [145] update              updateObject        values<-            values             
## [149] width<-             width               window<-            window             
## [153] with                xtfrm              
## see '?methods' for accessing help and source code

Notice that the available flank() methods have been augmented by the methods defined in the GenomicRanges package, including those that are relevant (via inheritance) to the GRanges class.

grep("flank", methods(class="GRanges"), value=TRUE)

## [1] "flank,GenomicRanges-method"

Verify that the help page documents the behavior we just observed.

?"flank,GenomicRanges-method"

Use help() to list the help pages in the GenomicRanges package, and vignettes() to view and access available vignettes; these are also available in the Rstudio ‘Help’ tab.

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")

3.2 GenomicRanges

3.2.1 The `GRanges` and `GRangesList` classes

Aside: ‘TxDb’ packages provide an R representation of gene models

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

exons(): GRanges

exons(txdb)

## GRanges object with 289969 ranges and 1 metadata column:
##                  seqnames         ranges strand   |   exon_id
##                     <Rle>      <IRanges>  <Rle>   | <integer>
##        [1]           chr1 [11874, 12227]      +   |         1
##        [2]           chr1 [12595, 12721]      +   |         2
##        [3]           chr1 [12613, 12721]      +   |         3
##        [4]           chr1 [12646, 12697]      +   |         4
##        [5]           chr1 [13221, 14409]      +   |         5
##        ...            ...            ...    ... ...       ...
##   [289965] chrUn_gl000241 [35706, 35859]      -   |    289965
##   [289966] chrUn_gl000241 [36711, 36875]      -   |    289966
##   [289967] chrUn_gl000243 [11501, 11530]      +   |    289967
##   [289968] chrUn_gl000243 [13608, 13637]      +   |    289968
##   [289969] chrUn_gl000247 [ 5787,  5816]      -   |    289969
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Alt Genomic Ranges

exonsBy(): GRangesList

exonsBy(txdb, "tx")

## GRangesList object of length 82960:
## $1 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand |   exon_id   exon_name exon_rank
##          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 [11874, 12227]      + |         1        <NA>         1
##   [2]     chr1 [12613, 12721]      + |         3        <NA>         2
##   [3]     chr1 [13221, 14409]      + |         5        <NA>         3
## 
## $2 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12595, 12721]      + |       2      <NA>         2
##   [3]     chr1 [13403, 14409]      + |       6      <NA>         3
## 
## $3 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12646, 12697]      + |       4      <NA>         2
##   [3]     chr1 [13221, 14409]      + |       5      <NA>         3
## 
## ...
## <82957 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

Alt Genomic Ranges List

GRanges / GRangesList are incredibly useful

Represent annotations – genes, variants, regulatory elements, copy number regions, …
Represent data – aligned reads, ChIP peaks, called variants, …

3.2.2 Algebra of genomic ranges

Many biologically interesting questions represent operations on ranges

Count overlaps between aligned reads and known genes – GenomicRanges::summarizeOverlaps()
Genes nearest to regulatory regions – GenomicRanges::nearest(), [ChIPseeker][]
Called variants relevant to clinical phenotypes – VariantFiltering

GRanges Algebra

Intra-range methods
- Independent of other ranges in the same object
- GRanges variants strand-aware
- shift(), narrow(), flank(), promoters(), resize(), restrict(), trim()
- See ?"intra-range-methods"
Inter-range methods
- Depends on other ranges in the same object
- range(), reduce(), gaps(), disjoin()
- coverage() (!)
- see ?"inter-range-methods"
Between-range methods
- Functions of two (or more) range objects
- findOverlaps(), countOverlaps(), …, %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

Alt Ranges Algebra

3.3 Biostrings (DNA or amino acid sequences)

Classes

XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads)

Methods –

Cheat sheat
Manipulation, e.g., reverseComplement()
Summary, e.g., letterFrequency()
Matching, e.g., matchPDict(), matchPWM()

Related packages

BSgenome
Whole-genome representations
Model and custom
ShortRead
FASTQ files

Example

Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly BSgenome packages. The following calculates GC content across chr14.

library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
chr14_dna <- getSeq(Hsapiens, chr14_range)
letterFrequency(chr14_dna, "GC", as.prob=TRUE)

##           G|C
## [1,] 0.336276

3.4 GenomicAlignments (Aligned reads)

Classes – GenomicRanges-like behaivor

GAlignments, GAlignmentPairs, GAlignmentsList

Methods

readGAlignments(), readGAlignmentsList()
Easy to restrict input, iterate in chunks
summarizeOverlaps()

Example

Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14

library(GenomicRanges)
library(GenomicAlignments)
library(Rsamtools)

## our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1)) 
## sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]

## supporting reads
paln[j_overlap$revmap[[1]]]

## GAlignmentsList object of length 8:
## [[1]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand      cigar qwidth    start      end width njunc
##   [1]    chr14      -  66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      + 7M1270N65M     72 19652348 19653689  1342     1
## 
## [[2]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      - 66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      +       72M     72 19653686 19653757    72     0
## 
## [[3]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      +       72M     72 19653675 19653746    72     0
##   [2]    chr14      - 65M120N7M     72 19653708 19653899   192     1
## 
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome

3.5 VariantAnnotation (Called variants)

Classes – GenomicRanges-like behavior

VCF – ‘wide’
VRanges – ‘tall’

Functions and methods

I/O and filtering: readVcf(), readGeno(), readInfo(), readGT(), writeVcf(), filterVcf()
Annotation: locateVariants() (variants overlapping ranges), predictCoding(), summarizeVariants()
SNPs: genotypeToSnpMatrix(), snpSummary()

Example

Read variants from a VCF file, and annotate with respect to a known gene model

## input variants
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf) <- "chr22"
## known gene model
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
coding <- locateVariants(rowRanges(vcf),
    TxDb.Hsapiens.UCSC.hg19.knownGene,
    CodingVariants())
head(coding)

## GRanges object with 6 ranges and 9 metadata columns:
##     seqnames               ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
##        <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
##   1    chr22 [50301422, 50301422]      - |   coding       939       939        24       75253
##   2    chr22 [50301476, 50301476]      - |   coding       885       885        25       75253
##   3    chr22 [50301488, 50301488]      - |   coding       873       873        26       75253
##   4    chr22 [50301494, 50301494]      - |   coding       867       867        27       75253
##   5    chr22 [50301584, 50301584]      - |   coding       777       777        28       75253
##   6    chr22 [50302962, 50302962]      - |   coding       698       698        57       75253
##             CDSID      GENEID       PRECEDEID        FOLLOWID
##     <IntegerList> <character> <CharacterList> <CharacterList>
##   1        218562       79087                                
##   2        218562       79087                                
##   3        218562       79087                                
##   4        218562       79087                                
##   5        218562       79087                                
##   6        218563       79087                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Related packages

ensemblVEP
Forward variants to Ensembl Variant Effect Predictor
VariantTools, h5vc
Call variants

Reference

Obenchain, V, Lawrence, M, Carey, V, Gogarten, S, Shannon, P, and Morgan, M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics, first published online March 28, 2014 doi:10.1093/bioinformatics/btu168

3.6 rtracklayer (Genome annotations)

import(): BED, GTF, WIG, 2bit, etc
export(): GRanges to BED, GTF, WIG, …
Access UCSC genome browser

3.7 SummarizedExperiment

Integrate experimental data with sample, feature, and experiment-wide annotations
Matrix where rows are indexed by genomic ranges, columns by a DataFrame.

Alt SummarizedExperiment

Functions and methods

Accessors: assay() / assays(), rowData() / rowRanges(), colData(), metadata()
Range-based operations, especially subsetByOverlaps()

4 Input & representation of standard file formats

4.1 BAM files of aligned reads – `GenomicAlignments`

Recall: overall workflow

Experimental design
Wet-lab preparation
High-throughput sequencing
Alignment
- Whole genome, vs. transcriptome
Summary
Statistical analysis
Comprehension

BAM files of aligned reads

Header

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
...
@SQ     SN:chrY LN:59373566
@PG     ID:TopHat       VN:2.0.8b       CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq

Alignments

ID, flag, alignment and mate

ERR127306.7941162       403     chr14   19653689        3       72M             =       19652348        -1413  ...
ERR127306.22648137      145     chr14   19653692        1       72M             =       19650044        -3720  ...

Sequence and quality

... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC        *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG        '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****

Tags

... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:2  CC:Z:chr22      CP:i:16189276   HI:i:0
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:3  CC:Z:=  CP:i:19921600   HI:i:0

Typically, sorted (by position) and indexed (‘.bai’ files)

GenomicAlignments

Use an example BAM file (fl could be the path to your own BAM file)

## example BAM data
library(RNAseqData.HNRNPC.bam.chr14)
## one BAM file
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## Let R know that this is a BAM file, not just a character vector
library(Rsamtools)
bfl <- BamFile(fl)

Input the data into R

aln <- readGAlignments(bfl)
aln

## GAlignments object with 800484 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width     njunc
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##        [1]    chr14      +         72M        72  19069583  19069654        72         0
##        [2]    chr14      +         72M        72  19363738  19363809        72         0
##        [3]    chr14      -         72M        72  19363755  19363826        72         0
##        [4]    chr14      +         72M        72  19369799  19369870        72         0
##        [5]    chr14      -         72M        72  19369828  19369899        72         0
##        ...      ...    ...         ...       ...       ...       ...       ...       ...
##   [800480]    chr14      -         72M        72 106989780 106989851        72         0
##   [800481]    chr14      +         72M        72 106994763 106994834        72         0
##   [800482]    chr14      -         72M        72 106994819 106994890        72         0
##   [800483]    chr14      +         72M        72 107003080 107003151        72         0
##   [800484]    chr14      -         72M        72 107003171 107003242        72         0
##   -------
##   seqinfo: 93 sequences from an unspecified genome

readGAlignmentPairs() / readGAlignmentsList() if paired-end data
Lots of things to do, including all the GRanges / GRangesList operations

methods(class=class(aln))

##   [1] aggregate              anyNA                  <=                     <                     
##   [5] ==                     >=                     >                      !=                    
##   [9] append                 as.character           as.complex             as.data.frame         
##  [13] as.env                 as.integer             as.list                as.logical            
##  [17] as.numeric             as.raw                 c                      cigar                 
##  [21] coerce                 compare                countOverlaps          coverage              
##  [25] duplicated             elementMetadata<-      elementMetadata        end                   
##  [29] eval                   expand                 export                 extractROWS           
##  [33] findCompatibleOverlaps findOverlaps           findSpliceOverlaps     granges               
##  [37] grglist                head                   high2low               %in%                  
##  [41] junctions              length                 lengths                mapCoords             
##  [45] mapFromAlignments      mapToAlignments        match                  mcols<-               
##  [49] mcols                  metadata<-             metadata               mstack                
##  [53] names<-                names                  narrow                 njunc                 
##  [57] NROW                   overlapsAny            parallelSlotNames      pintersect            
##  [61] pmapCoords             pmapFromAlignments     pmapToAlignments       qnarrow               
##  [65] qwidth                 ranges                 rank                   relistToClass         
##  [69] relist                 rename                 rep.int                replaceROWS           
##  [73] rep                    rev                    rglist                 rname<-               
##  [77] rname                  ROWNAMES               seqinfo<-              seqinfo               
##  [81] seqlevelsInUse         seqnames<-             seqnames               shiftApply            
##  [85] showAsCell             show                   sort                   split                 
##  [89] split<-                start                  strand<-               strand                
##  [93] subsetByOverlaps       subset                 summarizeOverlaps      table                 
##  [97] tail                   tapply                 unique                 update                
## [101] updateObject           values<-               values                 [<-                   
## [105] [                      width                  window<-               window                
## [109] with                   xtfrm                 
## see '?methods' for accessing help and source code

Caveat emptor: BAM files are large. Normally you will restrict the input to particular genomic ranges, or iterate through the BAM file. Key Bioconductor functions (e.g., GenomicAlignments::summarizeOverlaps() do this data management step for you. See next section!

4.2 Other formats and packages

Alt Files and the Bioconductor packages that input them

Genomic Ranges For Genome-Scale Data And Annotation

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

Contents

1 Bioconductor ‘infrastructure’ for sequence analysis

1.1 Classes, methods, and packages

1.2 Motivation

2 Core packages

3 Core classes

3.1 Case study: IRanges and GRanges

3.2 GenomicRanges

3.2.1 The `GRanges` and `GRangesList` classes

3.2.2 Algebra of genomic ranges

3.3 Biostrings (DNA or amino acid sequences)

3.4 GenomicAlignments (Aligned reads)

3.5 VariantAnnotation (Called variants)

3.6 rtracklayer (Genome annotations)

3.7 SummarizedExperiment

4 Input & representation of standard file formats

4.1 BAM files of aligned reads – `GenomicAlignments`

4.2 Other formats and packages

5 Resources

5.1 `sessionInfo()`

Genomic Ranges For Genome-Scale Data And Annotation

Martin Morgan (martin.morgan@roswellpark.org) Roswell Park Cancer Institute, Buffalo, NY 5 - 9 October, 2015

Contents

1 Bioconductor ‘infrastructure’ for sequence analysis

1.1 Classes, methods, and packages

1.2 Motivation

2 Core packages

3 Core classes

3.1 Case study: IRanges and GRanges

3.2 GenomicRanges

3.2.1 The GRanges and GRangesList classes

3.2.2 Algebra of genomic ranges

3.3 Biostrings (DNA or amino acid sequences)

3.4 GenomicAlignments (Aligned reads)

3.5 VariantAnnotation (Called variants)

3.6 rtracklayer (Genome annotations)

3.7 SummarizedExperiment

4 Input & representation of standard file formats

4.1 BAM files of aligned reads – GenomicAlignments

4.2 Other formats and packages

5 Resources

5.1 sessionInfo()

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

3.2.1 The `GRanges` and `GRangesList` classes

4.1 BAM files of aligned reads – `GenomicAlignments`

5.1 `sessionInfo()`