22 July | CSAMA 2019

Introduction

Started 2002 as a platform for understanding analysis of microarray data

More than 1,741 packages. Domains of expertise:

  • Sequencing (RNASeq, ChIPSeq, single-cell, called variants, …)
  • Microarrays (methylation, expression, copy number, …)
  • flow cytometry
  • proteomics

Important themes

  • Reproducible research
  • Interoperability between packages & work kflows
  • Usability

Resources

High-throughput sequence work flow

Genomic Sequences

Genomic Sequences

Biostrings

  • Valid data, e.g., alphabet
  • 'Vector' interface: length(), [, …
  • Specialized operations, e.g,. reverseComplement()

Classes

  • BString / BStringSet
  • DNAString / DNAStringSet
  • RNAString / RNAStringSet
  • AAString / AAStringSet

DNA Sequence Example

library(Biostrings)

dna <- DNAStringSet(c("AAACTG", "CCCTTCAAC", "TACGAA"))
dna
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     9 CCCTTCAAC
## [3]     6 TACGAA
length(dna)
## [1] 3
dna[c(1, 3, 1)]
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 AAACTG
## [2]     6 TACGAA
## [3]     6 AAACTG

DNA Sequence Example cont.

width(dna)
## [1] 6 9 6
as.character(dna[c(1,2)])
## [1] "AAACTG"    "CCCTTCAAC"
reverseComplement(dna)
##   A DNAStringSet instance of length 3
##     width seq
## [1]     6 CAGTTT
## [2]     9 GTTGAAGGG
## [3]     6 TTCGTA

Importing Genomic Sequence

Import Methods from FASTA/FASTQ

  • readBStringSet / readDNAStringSet / readRNAStringSet / readAAStringSet
filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
x1 <- readDNAStringSet(filepath1)
x1
##   A DNAStringSet instance of length 7
##     width seq                                          names               
## [1]  5573 ACTTGTAAATATATCTTTTAT...ATCGACCTTATTGTTGATAT YAL001C TFC3 SGDI...
## [2]  5825 TTCCAAGGCCGATGAATTCGA...AAATTTTTTTCTATTCTCTT YAL002W VPS8 SGDI...
## [3]  2987 CTTCATGTCAGCCTGCACTTC...TACTCATGTAGCTGCCTCAT YAL003W EFB1 SGDI...
## [4]  3929 CACTCATATCGGGGGTCTTAC...CCCGAAACACGAAAAAGTAC YAL005C SSA1 SGDI...
## [5]  2648 AGAGAAAGAGTTTCACTTCTT...TAATTTATGTGTGAACATAG YAL007C ERP2 SGDI...
## [6]  2597 GTGTCCGGGCCTCGCAGGCGT...TTTTGGCAGAATGTACTTTT YAL008W FUN14 SGD...
## [7]  2780 CAAGATAATGTCAAAGTTAGT...AAGGAAGAAAAAAAAATCAC YAL009W SPO7 SGDI...

Importing Genomic Sequences

Utilizing BSgenome Packages

BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using BiocManager::available("BSgenome")

library(BSgenome)
head(BiocManager::available("BSgenome"))
## [1] "BSgenome"                               
## [2] "BSgenome.Alyrata.JGI.v1"                
## [3] "BSgenome.Amellifera.BeeBase.assembly4"  
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

BSgenome

We can load and inspect a BSgenome package

library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Subsetting a BSgenome

The main accessor is getSeq, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region.

getSeq(BSgenome.Hsapiens.UCSC.hg19, "chr1")
##   249250621-letter "DNAString" instance
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
getSeq(BSgenome.Hsapiens.UCSC.hg19, GRanges("chr6:35310335-35395968"))
##   A DNAStringSet instance of length 1
##     width seq
## [1] 85634 GCGGAGCGTGTGACGCTGCGGCCGCCGCGGA...TGGGAATTAAATATTTAAGAGCTGACTGGAA

What is a GRanges object?