Home
Developers
Mentored Projects: GSOC 2013

Google Summer of Code Ideas List for 2013

This page is the main information point for Bioconductors participation in the Google summer of Code project this year (2013).

Overview

Each selected student (mentee) will be paid USD$5000 to work on a Bioconductor project for 3 months during the summer.

Students should look at the list of projects and see if any project interests them. Email the project mentors to express your interest, and describe any prior experience.

Students with ideas for Bioconductor projects not listed below are encouraged to email any of the mentors listed below with project ideas.

Students will submit project applications directly to Google.

Google will award a certain number of student slots to the Bioconductor project.

The Bioconductor administrators and mentors will rank projects in order of importance to the project, and the top projects will be funded.

Any selected students will be expected to register with the bioconductor and bioc-devel mailing lists.

There is a timeline posted at Google explaining how this works. Students are encouraged to look at this and make sure that they can commit to this. There is also a FAQ in case people have other questions that are not addressed here.

Here are our suggested ideas:

ExperimentHub project

Background/Motivation: As very large genomic data sets become more and more common, computational biologists are spending inordinate time transforming data from the format of the original resource to a format amenable to computation in their programming language of choice. The R / Bioconductor community needs programmatic access to cloud-based experimental data resources that can be readily incorporated into their own work flows.

Goal

AnnotationHub and its supporting packages are primed to support such a project. AnnotationHub provides infrastructure to make well-curated resources available to R software clients, but it needs the addition of a web interface to allow addition of user-supplied resources, including transformation of data into formats amenable to direct use by R clients.

Specific Aim

Work with us to create a web accessible interface that does the following:

allows the user to add large genomic resources and associated metadata to a NOSQL back-end database.
transform genomic resources into R / Bioconductor data structures to simplify end-user access via existing AnnotationHub infrastructure. For instance, the user might select from a list of 'recipes' (functions which transform raw data into a serialized R object) and/or contribute their own recipes for data transformation. In addition, the project might
allow users to upload literate programming documents illustrating the use of their data, and the documents' code would be evaluated by the web application, reporting any errors, and 4) implement social networking concepts like tags and rating systems to help guide users to the most useful resources.

Skills required

Familiarity with R and with AJAX Web 2.0 programming.

Test

Using the R packages Rook and rmongodb, write a web form that asks the user for their name and age and stores this information in a MongoDB collection. A second page asks the user for an age and then displays all records for people that age or older. This should be sent to us as an R package so that all we have to do (once we've installed MongoDb) is install the package and run:

library(testPkg)
run()

..and we'll see the desired functionality.

Mentors

Dan Tenenbaum dtenenba@fhcrc.org Marc Carlson mcarlson@fhcrc.org

Backup Mentor

Martin Morgan mtmorgan@fhcrc.org

Shiny Bioconductor Objects Project

Background/Motivation

The Shiny package allows for easy creation of interactive web graphics from R objects. Bioconductor packages have many objects that represent biological data or results. For each of these Bioconductor objects, there exists a typical set of visualizations to help users explore their data. Normally, these visuals are replotted several times until certain parameters are tweaked to show the image in a way that conveys a specific insight. This project pairs these standard Bioconductor objects with more user-friendly Shiny visualizations via new display() methods.

Tasks

Subscribe to the mailing lists for bioconductor and bioc-devel mailing lists
Using the mailing lists (which are searchable here and also the documentation from within the following packages familiarize yourself with the following Bionductor objects: Biobase::ExpressionSet, GenomicRanges::SummarizedExperiment, GenomicRanges::GRanges, GenomicRanges::GrangesList.

You will also explore the following topics:

How to annotate a set of IDs (see the vignette here).
How to use ggbio to draw a generic display from the GRanges or GRangesList objects.

Create display() methods for each of the four object types mentioned above.
These methods should use Shiny to display a multi-tabbed version of each of the objects in question where each tab offers an alternate view of the object in question. Each method should allow the caller to view a table of annotations in one tab and to also visualize the data using either a heat map or a Gviz plot ( as appropriate) in another tab.

Make sure that your R code to draws these 4 very similar methods is written in such a way so that you don't repeat yourself all the time.

Integrate these tests into the appropriate bioconductor packages along with proper documentation. Which package is appropriate will be determined later as the project develops.

Alternate Tasks

Most of the things above describe how you could approach making a display() method that depicts a pretty standard bioconductor object. These four objects are mentioned because we consider them to be the most important ones to do 1st. But consider implementing a useful display method for another object or maybe more. What should that look like and why would you choose it? Choose an object of personal interest to you and also make a display() method for it as well.

Skills required

Familiarity with R. Understanding of basic computational biology or an willingness/ability to learn about such things as needed.

Test

Find the example from the manual page for Biobase::ExpressionSet, and run it. And then plot the relevant data contained in the generated ExpressionSet object as a heatmap and save the output. Then send me the plot. BONUS: make sure that all the labels in your plot are fully visible.

Mentors

Marc Carlson mcarlson@fhcrc.org, Michael Lawrence lawrence.michael@gene.com

Backup Mentor

Martin Morgan mtmorgan@fhcrc.org

BiocParallel / BatchJobs integration

Background / Motivation

High-throughput sequencing generates data sets consisting of hundreds of millions of sequence reads per sample. As with any large data, timely processing depends on parallel computing. The Bioconductor project has developed the BiocParallel package, an abstraction around several parallel implementations in R. The API is tailored to typical use cases in biological data analysis and integrates with existing Bioconductor data structures. Another package, BatchJobs, executes R functions as scheduled cluster jobs, through an abstraction that has been implemented for several popular schedulers, including LSF, PBS and SGE.

As sequencing data pipelines are typically executed on managed clusters, there is a need for BiocParallel to interact with cluster schedulers. We aim to add a new backend to BiocParallel that delegates to BatchJobs for this interaction.

Tasks

Subscribe to the mailing lists for bioconductor and bioc-devel.
Install the BatchJobs package from CRAN.
Explore the functionality of BatchJobs. If you have no access to a managed cluster, there is a multi-core backend that functions on any machine with multiple cores and Mac or Linux.
Check out the source code for BiocParallel from github.
Familiarize yourself with the BiocParallel API and the existing implementations.
Propose a design for a new BiocParallel backend relying on BatchJobs.
After discussions with mentors, implement the backend.
Test the functionality through a collaboration with the authors of the HTSeqGenie Bioconductor package, a typical sequencing pipeline.

Skills required

Proficiency in the R programming language.
Familiarity with parallel computing; experience with at least one job scheduler is a plus.
Exposure to biological/sequencing use cases is a plus.

Tests

Come up with a simple, embarrassingly parallel problem and solve it using both serial and parallel approaches in R.

Mentors

Michael Lawrence lawrence.michael@gene.com
Backup: Martin Morgan mtmorgan@fhcrc.org

Source Code & Build Reports »

Source code is stored in svn (user: readonly, pass: readonly).

Software packages are built and checked nightly. Build reports:

Development Version»

Bioconductor packages under development:

Software
Annotation Data (Genome, Array, etc.)
Experiment Data

Developer Resources:

Hosting provided by Fred Hutchinson Cancer Research Center