Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147), Chong Tang [ctb], Laurent Gatto [ctb] (https://orcid.org/0000-0002-1520-2268)
Compiled: Thu Mar 7 18:40:41 2024

1 Introduction

The Spectra Bioconductor package provides a flexible and expandable infrastructure for Mass Spectrometry (MS) data. The package supports interchangeable use of different backends that provide additional file support or different ways to store and represent MS data. The MsBackendSql package provides backends to store data from whole MS experiments in SQL databases. The data in such databases can be easily (and efficiently) accessed using Spectra objects that use the MsBackendSql class as an interface to the data in the database. Such Spectra objects have a minimal memory footprint and hence allow analysis of very large data sets even on computers with limited hardware capabilities. For certain operations, the performance of this data representation is superior to that of other low-memory (on-disk) data representations such as Spectra’s MsBackendMzR backend. Finally, the MsBackendSql supports also remote data access to e.g. a central database server hosting several large MS data sets.

2 Installation

The package can be installed with the BiocManager package. To install BiocManager use install.packages("BiocManager") and, after that, BiocManager::install("MsBackendSql") to install this package.

3 Creating and using MsBackendSql SQL databases

MsBackendSql SQL databases can be created either by importing (raw) MS data from MS data files using the createMsBackendSqlDatabase or using the backendInitialize function by providing in addition to the database connection also the full MS data to import as a DataFrame. In the first example we use the createMsBackendSqlDatabase function which takes a connection to an (empty) database and the names of the files from which the data should be imported as input parameters creates all necessary database tables and stores the full data into the database. Below we create an empty SQLite database (in a temporary file) and fill that with MS data from two mzML files (from the r Biocpkg("msdata") package).

library(RSQLite)

dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)

library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

By default the m/z and intensity values are stored as BLOB data types in the database. This has advantages on the performance to extract peaks data from the database but would for example not allow to filter peaks by m/z values directly in the database. As an alternative it is also possible to the individual m/z and intensity values in separate rows of the database table. This long table format results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases).

The MsBackendSql package provides two backends to interact with such databases: the (default) MsBackendSql class and the MsBackendOfflineSql, that inherits all properties and functions from the former, but which does not store the connection to the database within the object but connects (and disconnects) to (and from) the database in each function call. This allows to use the latter also for parallel processing setups or to save/load the object (e.g. using save and saveRDS). Thus, for most applications the MsBackendOfflineSql might be used as the preferred backend to SQL databases.

To access the data in the database we create below a Spectra object providing the connection to the database in the constructor call and specifying to use the MsBackendSql as backend using the source parameter.

sps <- Spectra(con, source = MsBackendSql())
sps
## MSn data (Spectra) with 1862 spectra in a MsBackendSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/Rtmpqb6ORv/file3b2c309959876

As an alternative, the MsBackendOfflineSql backend could also be used to interface with MS data in a SQL database. In contrast to the MsBackendSql, the MsBackendOfflineSql does not contain an active (open) connection to the database and hence supports serializing (saving) the object to disk using e.g. the save function, or parallel processing (if supported by the database system). Thus, for most use cases the MsBackendOfflineSql should be used instead of the MsBackendSql. See further below for more information on the MsBackendOfflineSql.

Spectra objects allow also to change the backend to any other backend (extending MsBackend) using the setBackend function. Below we use this function to first load all data into memory by changing from the MsBackendSql to a MsBackendMemory.

sps_mem <- setBackend(sps, MsBackendMemory())
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
sps_mem
## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 34 more variables/columns.
## Processing:
##  Switch backend from MsBackendSql to MsBackendMemory [Thu Mar  7 18:40:49 2024]

With this function it is also possible to change from any backend to a MsBackendSql in which case a new database is created and all data from the originating backend is stored in this database. To change the backend to an MsBackendOfflineSql we need to provide the connection information to the SQL database as additional parameters. These parameters are the same that need to be passed to a dbConnect call to establish the connection to the database. These parameters include the database driver (parameter drv), the database name and eventually the user name, host etc (see ?dbConnect for more information). In the simple example below we store the data into a SQLite database and thus only need to provide the database name, which corresponds SQLite database file. In our example we store the data into a temporary file.

sps2 <- setBackend(sps_mem, MsBackendOfflineSql(), drv = SQLite(),
                   dbname = tempfile())
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.logical(col, optional = optional): Direct call of
## 'as.data.frame.logical()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead

## Warning in as.data.frame.numeric(col, optional = optional): Direct call of
## 'as.data.frame.numeric()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
## Warning in as.data.frame.integer(col, optional = optional): Direct call of
## 'as.data.frame.integer()' is deprecated.  Use 'as.data.frame.vector()' or
## 'as.data.frame()' instead
sps2
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/Rtmpqb6ORv/file3b2c3020a2254e
## Processing:
##  Switch backend from MsBackendSql to MsBackendMemory [Thu Mar  7 18:40:49 2024]
##  Switch backend from MsBackendMemory to MsBackendOfflineSql [Thu Mar  7 18:40:50 2024]

Similar to any other Spectra object we can retrieve the available spectra variables using the spectraVariables function.

spectraVariables(sps)
##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "ionisationEnergy"        
## [23] "lowMZ"                    "highMZ"                  
## [25] "mergedScan"               "mergedResultScanNum"     
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"  
## [29] "injectionTime"            "filterString"            
## [31] "spectrumId"               "ionMobilityDriftTime"    
## [33] "scanWindowLowerLimit"     "scanWindowUpperLimit"    
## [35] "spectrum_id_"

The MS peak data can be accessed using either the mz, intensity or peaksData functions. Below we extract the peaks matrix of the 5th spectrum and display the first 6 rows.

peaksData(sps)[[5]] |>
head()
##            mz intensity
## [1,] 105.0347         0
## [2,] 105.0362       164
## [3,] 105.0376         0
## [4,] 105.0391         0
## [5,] 105.0405       328
## [6,] 105.0420         0

All data (peaks data or spectra variables) are always retrieved on the fly from the database resulting thus in a minimal memory footprint for the Spectra object.

print(object.size(sps), units = "KB")
## 91.5 Kb

The backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.

sps$rtime <- sps$rtime + 10

Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.

print(object.size(sps), units = "KB")
## 106.2 Kb

Such $<- operations can also be used to cache spectra variables (temporarily) in memory which can eventually improve performance. Below we test the time it takes to extract the MS level from each spectrum from the database, then cache the MS levels in memory using $msLevel <- and test the timing to extract these cached variable.

system.time(msLevel(sps))
##    user  system elapsed 
##   0.012   0.000   0.011
sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))
##    user  system elapsed 
##   0.004   0.000   0.003

We can also use the reset function to reset the data to its original state (this will cause any local spectra variables to be deleted and the backend to be initialized with the original data in the database).

sps <- reset(sps)

To use the MsBackendOfflineSql backend we need to provide all information required to connect to the database along with the database driver to the Spectra function. Which parameters are required to connect to the database depends on the SQL database and the used driver. In our example the data is stored in a SQLite database, hence we use the SQLite() database driver and only need to provide the database name with the dbname parameter. For a MySQL/MariaDB database we would use the MariaDB() driver and would have to provide the database name, user name, password as well as the host name and port through which the database is accessible.

sps_off <- Spectra(dbfile, drv = SQLite(),
                   source = MsBackendOfflineSql())
sps_off
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
##        msLevel precursorMz  polarity
##      <integer>   <numeric> <integer>
## 1            1          NA         1
## 2            1          NA         1
## 3            1          NA         1
## 4            1          NA         1
## 5            1          NA         1
## ...        ...         ...       ...
## 1858         1          NA         1
## 1859         1          NA         1
## 1860         1          NA         1
## 1861         1          NA         1
## 1862         1          NA         1
##  ... 34 more variables/columns.
##  Use  'spectraVariables' to list all of them.
## Database: /tmp/Rtmpqb6ORv/file3b2c309959876

This backend provides the exact same functionality than MsBackendSql with the difference that the connection to the database is opened and closed for each function call. While this leads to a slightly lower performance, it allows to to serialize the object (i.e. save/load the object to/from disk) and to use it (and hence the Spectra object) also in a parallel processing setup. In contrast, for the MsBackendSql parallel processing is disabled since it is not possible to share the active backend connection within the object across different parallel processes.

Below we compare the performance of the two backends. The performance difference is the result from opening and closing the database connection for each call. Note that this will also depend on the SQL server that is being used. For SQLite databases there is almost no overhead.

library(microbenchmark)
microbenchmark(msLevel(sps), msLevel(sps_off))
## Unit: milliseconds
##              expr       min       lq     mean   median       uq      max neval
##      msLevel(sps)  9.586745 10.06017 10.67443 10.34408 10.59225 22.80745   100
##  msLevel(sps_off) 11.654732 12.33772 12.84116 12.64373 13.16487 22.62525   100
##  cld
##   a 
##    b

4 Performance comparison with other backends

The need to retrieve any spectra data on-the-fly from the database will have an impact on the performance of data access function of Spectra objects using the MsBackendSql backends. To evaluate its impact we next compare the performance of the MsBackendSql to other Spectra backends, specifically, the MsBackendMzR which is the default backend to read and represent raw MS data, and the MsBackendMemory backend that keeps all MS data in memory (and is thus not suggested for larger MS experiments). Similar to the MsBackendMzR, also the MsBackendSql keeps only a limited amount of data in memory. These on-disk backends need thus to retrieve spectra and MS peaks data on-the-fly from either the original raw data files (in the case of the MsBackendMzR) or from the SQL database (in the case of the MsBackendSql). The in-memory backend MsBackendMemory is supposed to provide the fastest data access since all data is kept in memory.

Below we thus create Spectra objects from the same data but using the different backends.

sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())

At first we compare the memory footprint of the 3 backends.

print(object.size(sps), units = "KB")
## 91.5 Kb
print(object.size(sps_mzr), units = "KB")
## 386.7 Kb
print(object.size(sps_im), units = "KB")
## 54494.5 Kb

The MsBackendSql has the lowest memory footprint of all 3 backends because it does not keep any data in memory. The MsBackendMzR keeps all spectra variables, except the MS peaks data, in memory and has thus a larger size. The MsBackendMemory keeps all data (including the MS peaks data) in memory and has thus the largest size in memory.

Next we compare the performance to extract the MS level for each spectrum from the 4 different Spectra objects.

library(microbenchmark)
microbenchmark(msLevel(sps),
               msLevel(sps_mzr),
               msLevel(sps_im))
## Unit: microseconds
##              expr      min        lq        mean    median         uq       max
##      msLevel(sps) 9306.745 9809.1190 10479.78978 10431.213 10625.5895 22384.947
##  msLevel(sps_mzr)  637.786  706.4860   731.81586   727.350   751.9045  1059.970
##   msLevel(sps_im)   16.787   22.2475    33.56157    36.038    40.8290    53.019
##  neval cld
##    100 a  
##    100  b 
##    100   c

Extracting MS levels is thus slowest for the MsBackendSql, which is not surprising because both other backends keep this data in memory while the MsBackendSql needs to retrieve it from the database.

We next compare the performance to access the full peaks data from each Spectra object.

microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
               peaksData(sps_mzr, BPPARAM = SerialParam()),
               peaksData(sps_im, BPPARAM = SerialParam()), times = 10)
## Unit: microseconds
##                                         expr        min         lq        mean
##      peaksData(sps, BPPARAM = SerialParam()) 147955.946 177458.377  421428.351
##  peaksData(sps_mzr, BPPARAM = SerialParam()) 670972.575 745351.292 1099580.387
##   peaksData(sps_im, BPPARAM = SerialParam())    540.139    588.955    2866.421
##       median          uq        max neval cld
##  423026.6160  636931.589  740244.31    10 a  
##  789868.2510 1359280.024 2038140.33    10  b 
##     849.8345    1095.169   21275.95    10   c

As expected, the MsBackendMemory has the fasted access to the full peaks data. The MsBackendSql outperforms however the MsBackendMzR providing faster access to the m/z and intensity values.

Performance can be improved for the MsBackendMzR using parallel processing. Note that the MsBackendSql does not support parallel processing and thus parallel processing is (silently) disabled in functions such as peaksData.

m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
               peaksData(sps_mzr, BPPARAM = m2),
               peaksData(sps_im, BPPARAM = m2), times = 10)
## Unit: microseconds
##                              expr        min        lq        mean     median
##      peaksData(sps, BPPARAM = m2) 139730.929 167098.57  179664.016 180113.074
##  peaksData(sps_mzr, BPPARAM = m2) 703332.286 726418.54 1023296.275 826623.375
##   peaksData(sps_im, BPPARAM = m2)    640.247    952.23    1151.078   1120.499
##           uq        max neval cld
##   193796.641  228036.35    10  a 
##  1329625.574 1934590.18    10   b
##     1326.029    1662.89    10  a

We next compare the performance of subsetting operations.

microbenchmark(filterRt(sps, rt = c(50, 100)),
               filterRt(sps_mzr, rt = c(50, 100)),
               filterRt(sps_im, rt = c(50, 100)))
## Unit: microseconds
##                                expr      min        lq      mean    median
##      filterRt(sps, rt = c(50, 100)) 4462.294 4908.8495 5353.3127 5251.3620
##  filterRt(sps_mzr, rt = c(50, 100)) 3279.726 3449.3205 4043.5726 3915.1535
##   filterRt(sps_im, rt = c(50, 100))  733.409  831.4595  940.1345  941.2645
##         uq       max neval cld
##  5662.8955  9466.997   100 a  
##  4166.4515 14844.528   100  b 
##   992.4015  2306.442   100   c

The two on-disk backends MsBackendSql and MsBackendMzR show a comparable performance for this operation. This filtering does involves access to a spectra variables (the retention time in this case) which, for the MsBackendSql needs first to be retrieved from the backend. The MsBackendSql backend allows however also to cache spectra variables (i.e. they are stored within the MsBackendSql object). Any access to such cached spectra variables can eventually be faster because no dedicated SQL query is needed.

To evaluate the performance of a pure subsetting operation we first define the indices of 10 random spectra and subset the Spectra objects to these.

idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
               sps_mzr[idx],
               sps_im[idx])
## Unit: microseconds
##          expr     min       lq      mean   median        uq      max neval cld
##      sps[idx] 192.613  221.330  282.0647  246.351  257.1555 4298.514   100  a 
##  sps_mzr[idx] 984.184 1002.190 1051.0081 1024.006 1104.5905 1228.096   100   b
##   sps_im[idx] 299.761  322.393  353.7656  334.387  358.9305 1452.582   100  a

Here the MsBackendSql outperforms the other backends because it does not keep any data in memory and hence does not need to subset these. The two other backends need to subset the data they keep in memory which is in both cases a data frame with either a reduced set of spectra variables or the full MS data.

At last we compare also the extraction of the peaks data from the such subset Spectra objects.

sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]

microbenchmark(peaksData(sps_10),
               peaksData(sps_mzr_10),
               peaksData(sps_im_10),
               times = 10)
## Unit: microseconds
##                   expr        min         lq        mean      median         uq
##      peaksData(sps_10)   2749.108   3894.828   4472.3196   4330.9810   5028.455
##  peaksData(sps_mzr_10) 108885.891 110958.592 115174.1266 115836.0820 117427.259
##   peaksData(sps_im_10)    553.615    580.984    674.5323    691.9875    720.924
##         max neval cld
##    6022.080    10 a  
##  120957.252    10  b 
##     898.104    10   c

The MsBackendSql outperforms the MsBackendMzR while, not unexpectedly, the MsBackendMemory provides fasted access.

4.1 Considerations for database systems/servers

The backends from the MsBackendSql package use standard SQL calls to retrieve MS data from the database and hence any SQL database system (for which an R package is available) is supported. SQLite-based databases would represent the easiest and most user friendly solution since no database server administration and user management is required. Indeed, performance of SQLite is very high, even for very large data sets. Server-based databases on the other hand have the advantage to enable a centralized storage and control of MS data (inclusive user management etc). Also, such server systems would also allow data set or server-specific configurations to improve performance.

A comparison between a SQLite-based with a MariaDB-based MsBackendSql database for a large data set comprising over 8,000 samples and over 15,000,000 spectra is available here. In brief, performance to extract data was comparable and for individual spectra variables even faster for the SQLite database. Only when more complex SQL queries were involved (combining several primary keys or data fields) the more advanced MariaDB database outperformed SQLite.

5 Other properties of the MsBackendSql

The MsBackendSql backend does not support parallel processing since the database connection can not be shared across the different (parallel) processes. Thus, all methods on Spectra objects that use a MsBackendSql will automatically (and silently) disable parallel processing even if a dedicated parallel processing setup was passed along with the BPPARAM method.

Some functions on Spectra objects require to load the MS peak data (i.e., m/z and intensity values) into memory. For very large data sets (or computers with limited hardware resources) such function calls can cause out-of-memory errors. One example is the lengths function that determines the number of peaks per spectrum by loading the peak matrix first into memory. Such functions should ideally be called using the peaksapply function with parameter chunkSize (e.g., peaksapply(sps, lengths, chunkSize = 5000L)). Instead of processing the full data set, the data will be first split into chunks of size chunkSize that are stepwise processed. Hence, only data from chunkSize spectra is loaded into memory in one iteration.

6 Summary

The MsBackendSql provides an MS data representations and storage mode with a minimal memory footprint (in R) that is still comparably efficient for standard processing and subsetting operations. This backend is specifically useful for very large MS data sets, that could even be hosted on remote (MySQL/MariaDB) servers. A potential use case for this backend could thus be to set up a central storage place for MS experiments with data analysts connecting remotely to this server to perform initial data exploration and filtering. After subsetting to a smaller data set of interest, users could then retrieve/download this data by changing the backend to e.g. a MsBackendMemory, which would result in a download of the full data to the user computer’s memory.

7 Session information

sessionInfo()
## R Under development (unstable) (2024-03-06 r86056)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] microbenchmark_1.4.10 RSQLite_2.3.5         MsBackendSql_1.3.4   
## [4] Spectra_1.13.5        ProtGenerics_1.35.3   BiocParallel_1.37.1  
## [7] S4Vectors_0.41.4      BiocGenerics_0.49.1   BiocStyle_2.31.0     
## 
## loaded via a namespace (and not attached):
##  [1] sandwich_3.1-0         sass_0.4.8             MsCoreUtils_1.15.4    
##  [4] lattice_0.22-5         hms_1.1.3              digest_0.6.34         
##  [7] grid_4.4.0             evaluate_0.23          bookdown_0.38         
## [10] mvtnorm_1.2-4          fastmap_1.1.1          blob_1.2.4            
## [13] Matrix_1.6-5           jsonlite_1.8.8         progress_1.2.3        
## [16] mzR_2.37.2             DBI_1.2.2              survival_3.5-8        
## [19] multcomp_1.4-25        BiocManager_1.30.22    TH.data_1.1-2         
## [22] codetools_0.2-19       jquerylib_0.1.4        cli_3.6.2             
## [25] rlang_1.1.3            crayon_1.5.2           Biobase_2.63.0        
## [28] splines_4.4.0          bit64_4.0.5            cachem_1.0.8          
## [31] yaml_2.3.8             tools_4.4.0            parallel_4.4.0        
## [34] memoise_2.0.1          ncdf4_1.22             vctrs_0.6.5           
## [37] R6_2.5.1               zoo_1.8-12             lifecycle_1.0.4       
## [40] fs_1.6.3               IRanges_2.37.1         bit_4.0.5             
## [43] clue_0.3-65            MASS_7.3-60.2          cluster_2.1.6         
## [46] pkgconfig_2.0.3        bslib_0.6.1            data.table_1.15.2     
## [49] Rcpp_1.0.12            xfun_0.42              knitr_1.45            
## [52] htmltools_0.5.7        rmarkdown_2.26         compiler_4.4.0        
## [55] prettyunits_1.2.0      MetaboCoreUtils_1.11.2