Authors: Laurent Gatto and Johannes Rainer
Compiled: Wed Jan 4 18:51:44 2017
In this vignette, we will document various timings and benchmarkings of the recent MSnbase development (aka
MSnbase2), that focuses on on-disk data access (as opposed to in-memory). More details about the new implementation will be documented elsewhere.
As a benchmarking dataset, we are going to use a subset of an TMT 6-plex experiment acquired on an LTQ Orbitrap Velos, that is distributed with the msdata package
library("msdata") f <- msdata::proteomics(full.names = TRUE, pattern = "TMT_Erwinia") basename(f)
##  "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
We need to load the MSnbase package and set the session-wide verbosity flag to
We first read the data using the original
readMSData function that generates an in-memory representation of the MS2-level raw data and measure the time needed for this operation.
system.time(inmem <- readMSData(f, msLevel = 2, centroided = TRUE))
## user system elapsed ## 8.768 0.056 8.933
Next, we use the
readMSData2 function to generate an on-disk representation of the same data.
system.time(ondisk <- readMSData2(f, msLevel = 2, centroided = TRUE))
## user system elapsed ## 1.756 0.060 1.815
Creating the on-disk experiment is considerable faster and scales to much bigger, multi-file data, both in terms of object creation time, but also in terms of object size (see next section). We must of course make sure that these two datasets are equivalent:
##  TRUE
To compare the size occupied in memory of these two objects, we are going to use the
object_size function from the pryr package, which accounts for the data (the spectra) in the
assayData environment (as opposed to the
object.size function from the
## 2.68 MB
## 115 kB
The difference is explained by the fact that for
ondisk, the spectra are not created and stored in memory; they are access on disk when needed, such as for example for plotting:
plot(inmem[], full = TRUE) plot(ondisk[], full = TRUE)
The drawback of the on-disk representation is when the spectrum data has to actually be accessed. To compare access time, we are going to use the microbenchmark and repeat access 10 times to compare access to all 451 and a single spectrum in-memory (i.e. pre-loaded and constructed) and on-disk (i.e. on-the-fly access).
library("microbenchmark") mb <- microbenchmark(spectra(inmem), inmem[], spectra(ondisk), ondisk[], times = 10) mb
## Unit: microseconds ## expr min lq mean median ## spectra(inmem) 113.602 227.513 379.4416 424.7175 ## inmem[] 61.139 90.675 106.8239 107.9255 ## spectra(ondisk) 1343787.959 1448279.208 1506550.8157 1497974.6970 ## ondisk[] 437905.225 538069.598 565057.5258 577406.1840 ## uq max neval cld ## 532.951 558.728 10 a ## 114.403 171.798 10 a ## 1584449.017 1606786.029 10 c ## 619962.862 640329.406 10 b
While it takes order or magnitudes more time to access the data on-the-fly rather than a pre-generated spectrum, accessing all spectra is only marginally slower than accessing all spectra, as most of the time is spent preparing the file for access, which is done only once.
On-disk access performance will depend on the read throughput of the disk. A comparison of the data import of the above file from an internal solid state drive and from an USB3 connected hard disk showed only small differences for the
readMSData2 call (1.07 vs 1.36 seconds), while no difference were observed for accessing individual or all spectra. Thus, for this particular setup, performance was about the same for SSD and HDD. This might however not apply to setting in which data import is performed in parallel from multiple files.
Data access does not prohibit interactive usage, such as plotting, for example, as it is about 1/2 seconds, which is an operation that is relatively rare, compared to subsetting and filtering, which are faster for on-disk data:
i <- sample(length(inmem), 100) system.time(inmem[i])
## user system elapsed ## 0.288 0.000 0.290
## user system elapsed ## 0.044 0.000 0.046
Operations on the spectra data, such as peak picking, smoothing, cleaning, … are cleverly cached and only applied when the data is accessed, to minimise file access overhead. Finally, specific operations such as for example quantitation (see next section) are optimised for speed.
Below, we perform TMT 6-plex reporter ions quantitation on the first 100 spectra and verify that the results are identical (ignoring feature names).
system.time(eim <- quantify(inmem[1:100], reporters = TMT6, method = "max"))
## user system elapsed ## 0.328 0.056 4.185
system.time(eod <- quantify(ondisk[1:100], reporters = TMT6, method = "max"))
## user system elapsed ## 0.388 0.024 0.415
all.equal(eim, eod, check.attributes = FALSE)
##  TRUE
This document focuses on speed and size improvements of the new on-disk
MSnExp representation. The extend of these improvements will substantially increase for larger data.
For general functionality about the on-disk
MSnExp data class and MSnbase in general, see other vignettes available with
vignette(package = "MSnbase")