The goal of megadepth is to provide an R interface to the command line tool Megadepth for BigWig and BAM related utilities created by Christopher Wilks (Wilks, Zhang, Collado-Torres, and Langmead, 2020). This R package enables fast processing of BigWig files on downstream packages such as dasper and recount3. The Megadepth software also provides utilities for processing BAM files and extracting coverage information from them.

Here is an illustration on how fast megadepth is compared to other tools for processing local and remote BigWig files.

Timing results. Timing comparison for processing 1,000 genomic regions on a bigWig file that is available on the local disk or on a remote resource. We compared megadepth against rtracklayer and pyBigWig. megadepth is typically faster that these other software solutions for computing the mean coverage across a set of 1,000 input genomic regions. Check <https://github.com/LieberInstitute/megadepth/tree/master/analysis> for more details.

Figure 1: Timing results
Timing comparison for processing 1,000 genomic regions on a bigWig file that is available on the local disk or on a remote resource. We compared megadepth against rtracklayer and pyBigWig. megadepth is typically faster that these other software solutions for computing the mean coverage across a set of 1,000 input genomic regions. Check https://github.com/LieberInstitute/megadepth/tree/master/analysis for more details.

Throughout the documentation we use a capital M to refer to the software by Christopher Wilks and a lower case m to refer to this R/Bioconductor package.

1 Basics

1.1 Install megadepth

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. megadepth is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install megadepth by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("megadepth")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Required knowledge

megadepth is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data and high-throughput sequencing data in general. You might benefit from being familiar with the BigWig file format and the rtracklayer for importing those files into R as well as exporting BED files (Lawrence, Gentleman, and Carey, 2009). If you are working with annoation files, GenomicFeatures and GenomicRanges will also be useful to you.

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

1.3 Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the megadepth tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

1.4 Citing megadepth

We hope that megadepth will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("megadepth")
#> 
#> Zhang D, Collado-Torres L (2021). _megadepth: BigWig and BAM related
#> utilities_. doi: 10.18129/B9.bioc.megadepth (URL:
#> https://doi.org/10.18129/B9.bioc.megadepth),
#> https://github.com/LieberInstitute/megadepth - R package version 1.0.3,
#> <URL: http://www.bioconductor.org/packages/megadepth>.
#> 
#> Wilks C, Zhang D, Collado-Torres L, Langmead B (2020). "megadepth:
#> BigWig and BAM related utilities." _bioRxiv_. doi: 10.1101/TODO (URL:
#> https://doi.org/10.1101/TODO), <URL:
#> https://www.biorxiv.org/content/10.1101/TODO>.
#> 
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.

2 Quick start

To get started, we need to load the megadepth package into our R session. This will load all the required dependencies.

library("megadepth")

Once we have the R package loaded, we need to install the Megadepth software. We can do so with install_megadepth(), which downloads a binary for your OS (Linux, Windows or macOS) 1 Please check Megadepth for instructions on how to compile the software from source if the binary version doesn’t work for you.. We can then use with an example BigWig file to compute the coverage at a set of regions.

## Install the latest version of Megadepth
install_megadepth(force = TRUE)
#> The latest megadepth version is 1.1.0b
#> megadepth has been installed to /home/biocbuild/bin

Next, we might want to use megadepth to quantify the coverage at a set of regions of the genome of interest to us. Here we will use two example files that are include in the package for illustration and testing purposes. One of them is a bigWig file that contains the base-pair coverage information for a sample of interest and the second one is BED file which contains the genomic region coordinates of interest. So we first locate them.

## Next, we locate the example BigWig and annotation files
example_bw <- system.file("tests", "test.bam.all.bw",
    package = "megadepth", mustWork = TRUE
)
annotation_file <- system.file("tests", "testbw2.bed",
    package = "megadepth", mustWork = TRUE
)

## Where are they?
example_bw
#> [1] "/tmp/Rtmpdknf3e/Rinst23ca20f7b4f8/megadepth/tests/test.bam.all.bw"
annotation_file
#> [1] "/tmp/Rtmpdknf3e/Rinst23ca20f7b4f8/megadepth/tests/testbw2.bed"

Once we have located the example files we can proceed to calculating the base-pair coverage for our genomic regions of interest. There are several ways to do this with megadepth, but here we use the user-friendly function get_coverage(). This function will perform a given operation op on the bigWig file for a given set of regions of interest (annotation). One of those operations is to compute the mean base-pair coverage for each input region. This is what we’ll do with our example bigWig file.

## We can then use megadepth to compute the coverage
bw_cov <- get_coverage(
    example_bw,
    op = "mean",
    annotation = annotation_file
)
bw_cov
#> GRanges object with 4 ranges and 1 metadata column:
#>         seqnames          ranges strand |     score
#>            <Rle>       <IRanges>  <Rle> | <numeric>
#>   [1]      chr10            0-10      * |      0.00
#>   [2]      chr10 8756697-8756762      * |     15.85
#>   [3]      chr10 4359156-4359188      * |      3.00
#>   [4] GL000219.1   168500-168620      * |      1.26
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

get_coverage() returns an object that is familiar to GenomicRanges users, that is, a GRanges object that can be used with other Bioconductor software packages (Huber, Carey, Gentleman, Anders, Carlson, Carvalho, Bravo, Davis, Gatto, Girke, Gottardo, Hahne, Hansen, Irizarry, Lawrence, Love, MacDonald, Obenchain, Oleś, Pagès, Reyes, Shannon, Smyth, Tenenbaum, Waldron, and Morgan, 2015).

This example is just the tip of the iceberg, as Megadepth and thus megadepth can do a lot of useful processing operations on BAM and bigWig files.

3 Users guide

3.1 Command interface

Megadepth is very powerful and can do a lot of different things. The R/Bioconductor package provides two functions for interfacing with Megadepth, megadepth_cmd() and megadepth_shell(). For the first one, megadepth_cmd(), you need to know the actual command syntax you want to use and format it accordingly. If you are more comfortable with R functions, megadepth_shell() uses cmdfun to power this interface and capture the standard output stream into R.

To make it easier to use, megadepth includes functions that simplify the number of arguments, read in the output files, and converts them into R/Bioconductor friendly objects, such as get_coverage() illustrated previously in the quick start section.

We hope that you’ll find megadepth and Megadepth useful for your work. If you are interested in checking how fast megadepth is, check out the speed analysis comparison against other tools. Note that the size of the files used and the number of genomic regions queried will affect the speed comparisons.

## R-like interface
## that captures the standard output into R
head(megadepth_shell(help = TRUE))
#> [1] "megadepth 1.1.0b"                 ""                                
#> [3] "BAM and BigWig utility."          ""                                
#> [5] "Usage:"                           "  megadepth <bam|bw|-> [options]"

## Command-like interface
megadepth_cmd("--help")
#> megadepth 1.1.0b
#>  
#>  BAM and BigWig utility.
#>  
#>  Usage:
#>    megadepth <bam|bw|-> [options]
#>  
#>  Options:
#>    -h --help                Show this screen.
#>    --version                Show version.
#>    --threads                # of threads to do: BAM decompression OR compute sums over multiple BigWigs in parallel
#>                              if the 2nd is intended then a TXT file listing the paths to the BigWigs to process in parallel
#>                              should be passed in as the main input file instead of a single BigWig file (EXPERIMENTAL).
#>    --prefix                 String to use to prefix all output files.
#>    --no-auc-stdout          Force all AUC(s) to be written to <prefix>.auc.tsv rather than STDOUT
#>    --no-annotation-stdout   Force summarized annotation regions to be written to <prefix>.annotation.tsv rather than STDOUT
#>    --no-coverage-stdout     Force covered regions to be written to <prefix>.coverage.tsv rather than STDOUT
#>    --keep-order             Output annotation coverage in the order chromosomes appear in the BAM/BigWig file
#>                             The default is to output annotation coverage in the order chromosomes appear in the annotation BED file.
#>                             This is only applicable if --annotation is used for either BAM or BigWig input.
#>  
#>  BigWig Input:
#>  Extract regions and their counts from a BigWig outputting BED format if a BigWig file is detected as input (exclusive of the other BAM modes):
#>                                            Extracts all reads from the passed in BigWig and output as BED format.
#>                                             This will also report the AUC over the annotated regions to STDOUT.
#>                                             If only the name of the BigWig file is passed in with no other args, it will *only* report total AUC to STDOUT.
#>    --annotation <bed>                      Only output the regions in this BED applying the argument to --op to them.
#>    --op <sum[default], mean, min, max>     Statistic to run on the intervals provided by --annotation
#>    --sums-only                             Discard coordinates from output of summarized regions
#>    --bwbuffer <1GB[default]>               Size of buffer for reading BigWig fil