Introduction to Bioconductor

22 July | CSAMA 2019

Introduction

Started 2002 as a platform for understanding analysis of microarray data

More than 1,741 packages. Domains of expertise:

Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
Microarrays (methylation, expression, copy number, …)
flow cytometry
proteomics
…

Important themes

Reproducible research
Interoperability between packages & work kflows
Usability

Resources

https://bioconductor.org
https://bioconductor.org/packages – software, annotation, experiment, workflow
https://support.bioconductor.org

High-throughput sequence work flow

Genomic Sequences

Biostrings

Valid data, e.g., alphabet
'Vector' interface: length(), [, …
Specialized operations, e.g,. reverseComplement()

Classes

BString / BStringSet
DNAString / DNAStringSet
RNAString / RNAStringSet
AAString / AAStringSet

DNA Sequence Example

library(Biostrings)

dna <- DNAStringSet(c("AAACTG", "CCCTTCAAC", "TACGAA"))
dna

##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     9 CCCTTCAAC
## [3]     6 TACGAA

length(dna)

## [1] 3

dna[c(1, 3, 1)]

##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     6 TACGAA
## [3]     6 AAACTG

DNA Sequence Example cont.

width(dna)

## [1] 6 9 6

as.character(dna[c(1,2)])

## [1] "AAACTG"    "CCCTTCAAC"

reverseComplement(dna)

##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 CAGTTT
## [2]     9 GTTGAAGGG
## [3]     6 TTCGTA

Importing Genomic Sequence

Import Methods from FASTA/FASTQ

readBStringSet / readDNAStringSet / readRNAStringSet / readAAStringSet

filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
x1 <- readDNAStringSet(filepath1)
x1

##   A DNAStringSet instance of length 7
##     width seq                                          names               
## [1]  5573 ACTTGTAAATATATCTTTTAT...ATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
## [2]  5825 TTCCAAGGCCGATGAATTCGA...AAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
## [3]  2987 CTTCATGTCAGCCTGCACTTC...TACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
## [4]  3929 CACTCATATCGGGGGTCTTAC...CCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
## [5]  2648 AGAGAAAGAGTTTCACTTCTT...TAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
## [6]  2597 GTGTCCGGGCCTCGCAGGCGT...TTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
## [7]  2780 CAAGATAATGTCAAAGTTAGT...AAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...

Importing Genomic Sequences

Utilizing BSgenome Packages

BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using BiocManager::available("BSgenome")

library(BSgenome)
head(BiocManager::available("BSgenome"))

## [1] "BSgenome"                               
## [2] "BSgenome.Alyrata.JGI.v1"                
## [3] "BSgenome.Amellifera.BeeBase.assembly4"  
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

BSgenome

We can load and inspect a BSgenome package

library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19

## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Subsetting a BSgenome

The main accessor is getSeq, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region.

getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1")

##   249250621-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

getSeq(BSgenome.Hsapiens.UCSC.hg19, GRanges("chr6:35310335-35395968"))

##   A DNAStringSet instance of length 1
##     width seq
## [1] 85634 GCGGAGCGTGTGACGCTGCGGCCGCCGCGGA...TGGGAATTAAATATTTAAGAGCTGACTGGAA

What is a GRanges object?

giphy.com

Genomic Ranges

GenomicRanges

GenomicRanges objects allow for easy selection and subsection of data based on genomic position information.

Where are GenomicRanges used?

Everywhere…

Data (e.g., aligned reads, called peaks, copy number)
Annotations (e.g., genes, exons, transcripts, TxDb)
Close relation to BED files (see rtracklayer::import.bed() and HelloRanges)
Anywhere there is Genomic positioning information

Utility

Also vector interface – length(), [, etc.
Tidyverse: plyranges

GenomicRanges

library(GenomicRanges)
gr <- GRanges(c("chr1:100-120", "chr1:115-130"))
gr

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   100-120      *
##   [2]     chr1   115-130      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr <- GRanges(c("chr1:100-120", "chr1:115-130", "chr2:200-220"),
strand=c("+","+","-"), GC = seq(1, 0, length=3), id = paste0("id",1:3))
gr

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |        GC          id
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1   100-120      + |         1         id1
##   [2]     chr1   115-130      + |       0.5         id2
##   [3]     chr2   200-220      - |         0         id3
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

There are lots of utility functions for GRange objects

Help!

methods(class="GRanges")
methods(class="GRangesList")
?"GRanges"
?"GRangesList"

GRanges functions

Intra-range operations

e.g., shift(), narrow(), flank()

Inter-range operations

e.g., reduce(), coverage(), gaps(), disjoin()

Between-range operations

e.g., countOverlaps(), findOverlaps(), summarizeOverlaps()

GRanges Example

shift(gr, 1)

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |        GC          id
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1   101-121      + |         1         id1
##   [2]     chr1   116-131      + |       0.5         id2
##   [3]     chr2   201-221      - |         0         id3
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

reduce(gr)

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   100-130      +
##   [2]     chr2   200-220      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

anno <- GRanges(c("chr1:110-150", "chr2:150-210"))
countOverlaps(anno, gr)

## [1] 2 1

Lists of Genomic Ranges

- e.g., exons-within-transcripts, alignments-within-reads

More examples

Returning to BSGenome: Get the sequences for three UTR regions? threeUTRsByTranscript() returns a GRangesList

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

threeUTR <-  threeUTRsByTranscript(txdb, use.names=TRUE)
threeUTR_seq <- extractTranscriptSeqs(Hsapiens, threeUTR)
options(showHeadLines = 3, showTailLines = 2)
threeUTR_seq

##   A DNAStringSet instance of length 60740
##         width seq                                      names               
##     [1]   770 TGCCCGTTGGAGAAAACAG...GCACACTGTTGGTTTCTG uc010nxq.1
##     [2]  1333 CTGTGAGGCCATTTCCAGG...CCTCCCACGCGGACAGAG uc009vjk.2
##     [3]  2976 CTGTGAGGCCATTTCCAGG...GAAAAATATCGCCCACGA uc001aau.3
##     ...   ... ...
## [60739]  1191 GCACCCTGACCCTATTCAG...GGAATTTGTATTTAATAA uc011mgh.2
## [60740]  1191 GCACCCTGACCCTATTCAG...GGAATTTGTATTTAATAA uc011mgi.2

More examples

How many genes are between 2858473 and 3271812 on chr2 in the hg19 genome?

gns <- genes(txdb)
gns[gns %over%  GRanges("chr2:2858473-3271812")]

## GRanges object with 1 range and 1 metadata column:
##        seqnames          ranges strand |     gene_id
##           <Rle>       <IRanges>  <Rle> | <character>
##   7260     chr2 3192741-3381653      - |        7260
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

## OR

subsetByOverlaps(gns, GRanges("chr2:2858473-3271812"))

## GRanges object with 1 range and 1 metadata column:
##        seqnames          ranges strand |     gene_id
##           <Rle>       <IRanges>  <Rle> | <character>
##   7260     chr2 3192741-3381653      - |        7260
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Importing a BED file

We said earlier, GRanges are closely related to bed files. Lets look at the example in the rtracklayer::import.bed help page:

library(rtracklayer)
test_bed <- system.file(package = "rtracklayer", "tests", "test.bed")
test <- import(test_bed)
test

## UCSC track 'ItemRGBDemo'
## UCSCData object with 5 ranges and 5 metadata columns:
##       seqnames              ranges strand |        name     score
##          <Rle>           <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr7 127471197-127472363      + |        Pos1         0
##   [2]     chr7 127472364-127473530      + |        Pos2         2
##   [3]     chr7 127473531-127474697      - |        Neg1         0
##   [4]     chr9 127474698-127475864      + |        Pos3         5
##   [5]     chr9 127475865-127477031      - |        Neg2         5
##           itemRgb               thick                  blocks
##       <character>           <IRanges>           <IRangesList>
##   [1]     #FF0000 127471197-127472363 1-300,501-700,1068-1167
##   [2]     #FF0000 127472364-127473530          1-250,668-1167
##   [3]     #FF0000 127473531-127474697                  1-1167
##   [4]     #FF0000 127474698-127475864                  1-1167
##   [5]     #0000FF 127475865-127477031                  1-1167
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Bed file continued

In fact this class Extends the GenomicRange GRange class

is(test, "GRanges")

## [1] TRUE

So you can use GRange functions

subsetByOverlaps(test, GRanges("chr7:127471197-127472368"))

## UCSC track 'ItemRGBDemo'
## UCSCData object with 2 ranges and 5 metadata columns:
##       seqnames              ranges strand |        name     score
##          <Rle>           <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr7 127471197-127472363      + |        Pos1         0
##   [2]     chr7 127472364-127473530      + |        Pos2         2
##           itemRgb               thick                  blocks
##       <character>           <IRanges>           <IRangesList>
##   [1]     #FF0000 127471197-127472363 1-300,501-700,1068-1167
##   [2]     #FF0000 127472364-127473530          1-250,668-1167
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Side Note:

Utilizing Bioconductor recommended import/export methods, classes, etc. has other benefits as well…

BED files have 0-based half-open intervals (left end point included, right endpoint 'after' the end of the range),

whereas in other parts of the bioinformatic community and in bioc the coordinates are 1-based and closed

Using import() converts BED coordinates into bioc coordinates.

Working with BAM files

library(GenomicAlignments)
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
                  recursive=TRUE, pattern="*bam$", full=TRUE)
names(fls) <- basename(fls)
bf <- BamFileList(fls, index=character(), yieldSize=1000)
genes <- GRanges(
    seqnames = rep(c("chr2L", "chr2R", "chr3L"), c(4, 5, 2)),
    ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600, 
                       4000, 7500, 5000, 5400), 
                     width=c(rep(500, 3), 600, 900, 500, 300, 900, 
                             300, 500, 500))) 
se <- summarizeOverlaps(genes, bf)
se

## class: RangedSummarizedExperiment 
## dim: 11 2 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(2): sm_treated1.bam sm_untreated1.bam
## colData names(0):

# Start differential expression analysis with the DESeq2 or edgeR

What is a RangedSummarizedExperiment?

giphy.com

SummarizedExperiments

SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data). Incoming packages are now strongly recommended to use this class representation instead of ExpressionSet.

three components: underlying 'matrix' 'row' annotations (genomic features) 'column' annotations (sample descriptions)

ExpressionSet

giphy.com

Summarized Experiments

Components 1.

Main matrix of values / count data / expression values / etc …

counts <- as.matrix(read.csv("lecture-02-data/airway_counts.csv", row.names=1))
head(counts, 3)

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229

Component 2.

Sample data / Phenotypic data / Sample specific information / etc …

colData <- read.csv("lecture-02-data/airway_colData.csv", row.names=1)
colData[, 1:4]

##            SampleName    cell   dex albut
## SRR1039508 GSM1275862  N61311 untrt untrt
## SRR1039509 GSM1275863  N61311   trt untrt
## SRR1039512 GSM1275866 N052611 untrt untrt
## SRR1039513 GSM1275867 N052611   trt untrt
## SRR1039516 GSM1275870 N080611 untrt untrt
## SRR1039517 GSM1275871 N080611   trt untrt
## SRR1039520 GSM1275874 N061011 untrt untrt
## SRR1039521 GSM1275875 N061011   trt untrt

Component 3.

Genomic position information / Information about features / etc …

rowRanges <- readRDS("lecture-02-data/airway_rowRanges.rds")
options(showHeadLines = 3, showTailLines = 2)
rowRanges

## GRangesList object of length 33469:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    ...      ...               ...    ... .       ...             ...
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <33468 more elements>

Benefit of Containers

Could manipulate independently…

cidx <- colData$dex == "trt"
plot(
    rowMeans(1 + counts[, cidx]) ~ rowMeans(1 + counts[, !cidx]),
    log="xy"
)

very fragile, e.g., what if colData rows had been re-ordered?

SummarizedExperiment

Solution: coordinate in a single object – SummarizedExperiment

library(SummarizedExperiment)

se <- SummarizedExperiment(counts, rowRanges = rowRanges, colData = colData)
cidx <- se$dex == "trt"
plot(
    rowMeans(1 + assay(se)[, cidx]) ~ rowMeans(1 + assay(se)[, !cidx]),
    log="xy"
)

Benefits

Much more robust to 'bookkeeping' errors
matrix-like interface: dim(), two-dimensional [, …
accessors: assay(), rowData() / rowRanges(), colData(), …

dim(se)

## [1] 33469     8

colnames(se[1:2,3:4])

## [1] "SRR1039512" "SRR1039513"

names(colData)

## [1] "SampleName" "cell"       "dex"        "albut"      "Run"       
## [6] "avgLength"  "Experiment" "Sample"     "BioSample"

Help

In R:

methods(class="SummarizedExperiment")
?"SummarizedExperiment"
browseVignettes(package="SummarizedExperiment")

Support site, https://support.bioconductor.org
Bioc-devel, bioc-devel@r-project.org

Website: https://bioconductor.org

Introduction

High-throughput sequence work flow

Genomic Sequences

Genomic Sequences

DNA Sequence Example

DNA Sequence Example cont.

Importing Genomic Sequence

Importing Genomic Sequences

BSgenome

Subsetting a BSgenome

What is a GRanges object?

Genomic Ranges

Where are GenomicRanges used?

GenomicRanges

GRanges functions

GRanges Example

Lists of Genomic Ranges

More examples

More examples

Importing a BED file

Bed file continued

Side Note:

Working with BAM files

What is a RangedSummarizedExperiment?

SummarizedExperiments

ExpressionSet

Summarized Experiments

Components 1.

Component 2.

Component 3.

Benefit of Containers

SummarizedExperiment

Benefits

Popular

Help