Using Bioconductor for Sequence Data

Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM, gff, bed, and wig files, among others. Packages support common and advanced sequence manipulation operations such as trimming, transformation, and alignment. Domain-specific analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other approaches. Bioconductor includes an interface to the Sequence Read Archive (via the SRAdb package).

Sample Workflow

This is a simple work flow for single-end RNA-seq looking at differential representation of known genes. The data is in a directory dataDir, as implied by the commands below (<...> indicates information to be provide by the user).

dataDir <- <...>
fastqDir <- file.path(dataDir, "fastq")
bamDir <- file.path(dataDir, "bam")
outputDir <- file.path(dataDir, "output")

Sequencing is done outside R; at the start of the work flow we have access to fastq files, in the fastqDir directory. A first step after sequencing might use ShortRead to produce a quality assessment report. Down-sample fastq files if they are big.

fls <- list.files(fastqDir, "fastq$", full=TRUE)
names(fls) <- sub(".fastq", "", basename(fls))
## use FastqSampler if fastq files are large
qas <- lapply(seq_along(fls),
              function(i, fls) qa(readFastq(fls[i]), names(fls)[i]),
qa <-, qas)
save(qa, file=file.path(outputDir, "qa.rda")

At this stage, one might use tools from ShortRead or Biostrings to remediate low quality reads (e.g., trim low quality tails or remove artifacts of sample preparation).

Alignment is usually done outside R. Output is a BAM file, one per sample. Biostrings::matchPDict (in some circumsances) and the Rsubread package might also be used.

Differential representation typically proceeds from a collection of known gene locations, including information about gene structure. Known gene information can come from a variety of sources, conveniently from UCSC or biomaRt using the GenomicFeatures package. These can be saved to disk as sqlite data bases for future use or to be shared with lab mates. With R >= 2.14.0, there are semi-annual snapshots available, as used here.

txdb <- Scerevisiae_UCSC_sacCer2_ensGene_TxDb
gnModel <- exonsBy(txdb, "gene")

There are different ways to count, some of which are implemented in GenomicRanges::summarizeOverlaps (available with R >= 2.15). Here is a simple function that counts any kind of overlap between known genes as a 'hit', and that discards a read hitting more than one gene. This is not always satisfactory, but illustrates the flexibility available.

bamFls <- list.files(bamDir, "bam$", full=TRUE)
names(bamFls) <- sub("\\..*", "", basename(bamFls))
counter <- function(fl, gnModel)
    aln <- readGappedAlignments(fl)
    strand(aln) <- "*" # for strand-blind sample prep protocol
    hits <- countOverlaps(aln, gnModel)
    counts <- countOverlaps(gnModel, aln[hits==1])
    names(counts) <- names(gnModel)
counts <- sapply(bamFls, counter, gnModel)
save(counts, file=file.path(outputDir, "counts.rda"))

edgeR and DESeq are mature packages that take a sophisticated statistical approach to the analysis of differential representation; DEXSeq is a recent package (available in R >= 2.14) to identify differential exon use. An edgeR work flow is


## identify treatment groups
grp <- factor(<...>)

## create data structures
dge <- DGEList(counts, group=grp)
dge <- calcNormFactors(dge)

## filter uniformative genes
m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size)
ridx <- rowSums(m > 1) >= 2
dge <- dge[ridx,]

## comparison between groups
design <- model.matrix( ~ grp )
dge <- estimateCommonDisp(dge, design)
fit <- glmFit(dge, design, dispersion=dge$common.dispersion)
lrTest <- glmLRT(dge, fit, coef=2)
tt <- topTags(lrTest, Inf)
save(tt, file=file.path(dataDir, "tt.rda"))

Two sanity checks along the way are that the treatment groups are relatively distinct in an MDS (multi-dimensional scaling) plot, and that the differentially representation tags really are.

       function(x, grp) tapply(counts[x,], grp, mean), 

Some annotation information is available from the org* model organism packages; the biomaRt package is an alternative. Here we get the sgd ids from the top table, and use these to add gene name annotations

ttids <- rownames(tt$table)
sgd2gnname <- mget(ttids, org.Sc.sgdGENENAME, ifnotfound=NA)
ttAnno <- cbind(tt$table, genename=unlist(sgd2gnname))

A next step might involve gene set analyses, as in the goseq package.

[ Back to top ]

Installation and Use

Follow installation instructions to start using these packages. To install the ShortRead package and all of its dependencies, evaluate the commands

> source("")
> biocLite("ShortRead")

Package installation is required only once per R installation. View a full list of available packages.

To use the ShortRead package, evaluate the command

> library("ShortRead")

This instruction is required once in each R session.

[ Back to top ]

Exploring Package Content

Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like

> help(package="ShortRead")
> ?readFastq

to obtain an overview of help on the ShortRead package, and the readFastq function, and

> browseVignettes(package="ShortRead")

to view vignettes (providing a more comprehensive introduction to package functionality; the edgeR vignette is an excellent example of this) in the ShortRead package. Use

> help.start()

to open a web page containing comprehensive help resources.

[ Back to top ]

Sequencing Resources

The following packages illustrate the diversity of functionality available; all are in the release version of Bioconductor.

Additional packages are classified in the biocViews hierarchy, for instance under Software : AssayTechnologies : HighThroughputSequencing.

[ Back to top ]

Fred Hutchinson Cancer Research Center