Bioconductor Newsletter

posted by Valerie Obenchain, July 2014

Contents

Software Infrastructure

BSgenome packages 2bit conversion

Current use cases for BSgenome packages emphasize quering many smaller regions across the entire genome, e.g., assembling many transcript sequences from the underlying coding sequence genomic ranges. To efficiently enable this use case, many BSgenome data packages are now using the UCSC 2bit format to store the sequences on disk.

One limitation of the 2bit format is that it does not support genomes that contain letters other than As, Cs, Gs, Ts, or Ns. These genomes (e.g. hg17, hg18, GRCh38, Ecoli, TAIR.04232008, and TAIR.TAIR9) use the previous .rda storage format, with fast whole-chromosome access but somewhat slower but still very usable random access.

The 2bit format is currently available in devel, but will be part of the next release of Bioconductor. Hervé and Martin worked on this project with thanks to Michael for supporting the 2bit format in rtracklayer.

S4 Vectors package

In April Hervé started to split out the low level functions from IRanges and move them to the new S4Vectors package. IRanges had grown to 90 classes, 157 generics and 844 methods and was becoming difficult to maintain. The plan is to move code that does not involve ranges, e.g., the Vector and List virtual classes, and DataFrame, Rle and Hits concrete classes. This is a work in progress and estimated to be about 30% done.

pileups

Nate has been working on a pileup() function which computes pileup statistics in BAM files. Design goals were versatile record filtering and flexible presentation of results for downstream analyses. Filtering is achieved through the ScanBamParam and PileupParam objects; output is a data.frame with variable columns based on the filtering applied.

pileup() is available in the Rsamtools package in devel. Other Bioconductor packages that offer pileup-like functions include gmapR (bam_tally), deepSNV (bam2R) and Rsubread (featureCounts). All have slightly different input requirements and output formats, see the man pages for details.

Git-SVN Bridge

The Bioconductor project uses a Subversion (SVN) source control system. SVN is effective for version control but does not offer social coding features such as GitHub’s issue tracking, pull requests or ease in granting permissions.

In response to popular request, Dan created the Git-SVN Bridge to allow Github repositories to sync with the Bioconductor SVN repository. Commits made in SVN are propagated to GitHub and vice versa. The service has been well received, with 73 bridges created as of June 2014.

To create a bridge see the Git-SVN Bridge HOWTO.

Bioconductor Amazon Machine Image (AMI)

The Bioconductor AMI has been overhauled and is now compatible with StarCluster. These enhancements make it straightforward to spin up a cluster with nodes that communicate via MPI, SSH or Sun Grid Engine. Details available at the AMI page.

The process of creating the AMI has been automated using Vagrant and Chef. Our scripts are publicly available, and can be used to provision an AMI, or a virtual machine (using Virtualbox or VMware) or even a physical machine.

Education and Outreach

Web site re-design

The Bioconductor home page now has what we hope is a more intuitive and user friendly interface. The ‘Install’, ‘Learn’, ‘Use’ and ‘Develop’ fields organize resources for the novice through the advanced developer.

Have a look at the new design.

biocViews

Sonali continued her work on biocViews this quarter. A new function, recommendBiocViews() is available in the biocViews package. The function looks at words in the DESCRIPTION, man pages and vignette, and suggests possible terms for use in the biocViews: field of package DESCRIPTION files. The function also identifies invalid biocViews terms (e.g., mis-spellings) present in the package DESCRIPTION file.

recommendBiocViews() has been incorporated into the Single Package Builder that checks new package submissions; new package authors are encouraged to run it before submitting a package. Remember that biocViews are case-sensitive and branch-sensitive (i.e., terms for a Software package must come from the Software branch of biocViews).

Sonali distributed recommended views for all devel software packages to the mailing list in June. For the complete list see this post.

Instructional videos

We are looking into short, single-topic videos as an interactive complement to traditional vignettes and workflows.

The plan is to create a series of 5 minute videos that encapsulate a HOWTO skill or overview a project aspect. You can tour the website with Dan or do a ‘quick start’ with Martin’s overview of key packages and classes. Watch Marc slice and dice an AnnotationDb object or read BAM and VCF files with Sonali and Valerie.

Sonali and Martin have led this effort and plan to unveil the first videos at BioC 2014 in Boston.

Build System

Branching the experimental data Subversion repository

Historically, only the Subversion repository for the Bioconductor software packages had a distinct branch for each release. Subversion repositories for experimental data and annotation packages had a trunk with no branches.

Starting with the Spring 2014 release a branch was created in the Subversion repository for the experimental data. The motivation was to allow software and experimental data packages to evolve together in release and devel build environments. It was often the case that updates to a software package broke the companion experimental data. Changes made to the experimental data were committed to trunk and propagated to both release and devel builds creating incompatibilities in one place or the other.

A consequence of creating the new branch is the need to bump ‘y’ of the ‘x.y.z’ version numbering scheme at release time (as we do for software). The Annotations have not changed; they are not under Subversion, do not go through automated builds and do not have a version policy.

Changes in how the experimental data Subversion repository is manged are relevant for developers only. The public repositories remain the same with a separate repository for each Bioconductor version. There is no visible change for users accessing the public repositories.

New Mac OS X Mavericks build machines

An R 3.1.0 binary for Mac OS X 10.9 (Mavericks) is now available from R Core. This R as been built with Xcode 5 to leverage new compilers and functionalities in Mavericks not available in earlier OS X versions.

To provide compatible Bioconductor package binaries we needed new build machines. Dan has configured two new Mavericks, one in release (morelia) and one in devel (oaxaca).

The introduction of Xcode and the clang compiler resulted in new errors for packages with C and C++ code. Nate and Dan spent many hours troubleshooting with package authors and came up with a list of common problems and solutions. Lessons learned were distilled into the C++/Mavericks Best Practices document.

Quarterly Project Statistics

There were 86953 downloads of Bioconductor software packages over the past quarter (April - June). During this time 41 new software packages were accepted. A full summary of package download stats is available here.

The web site saw approximately 119,000 visitors (26% increase from the previous year) from 180 countries, with the US, China, United Kingdom, Germany, and Canada at the head of the pack.

Resources, Courses and Conferences

Data Analysis for Genome Biology (CSAMA)

This one week intensive course is offered each year in Brixen-Bressanone, Italy and focuses on statistical and computational analysis of large-scale biological experiments. The course is intended for researchers with basic familiarity with the experimental technologies and who are interested in developing their own advanced data analyses.

Topics this year included RNASeq differential expression, variant calling and ChIP-Seq as well as the essentials of statistical testing, machine learning, visualization and of course using R. Michael Lawrence presented a Scalable Genomics lab which covered topics of limiting resource consumption, using iteration when appropriate, and scaling genome graphics. Much of the material is based on a manuscript currently in press at the journal Statistical Science.

Materials from the June 2014 course are available on the web.

Community Resources

New Community Resources links include the book and lab from MOOC: PH525x Data Analysis for Genomics. This online course was offered in April 2014 by Rafael Irizarry and Michael Love. Course goals were to enable students to analyze and interpret data generated by modern genomics technology, specifically microarray and next generation sequencing. Applications included gene expression, association of genomic variants to disease, and measuring epigenetic marks.

Also on the Community page are links to YouTube videos made by community members, tips on getting started with R/Bioconductor by Thomas Girke, analysis of 23andme data by Vince Buffalo and Sean Davis’ R/Bioconductor blog.

BioC 2014

The annual meeting is in Boston this year (July 30 - August 1). See the web site for a list of speakers and workshops.

Please send comments or questions to Valerie at vobencha@fhcrc.org.