1 Introduction

Mass spectrometry-based proteomics and metabolomics are the preferred technologies to study the protein and metabolite landscape of complex biological systems. The Orbitrap mass analyzer is one of the key innovations that propelled the field by providing high-resolution accurate mass (HRAM) data on a chromatographic time scale. Driven by the need to analyze the resulting LC-MS data, several specialized software tools have been developed in the last decade. In the academic environment, MaxQuant(Cox and Mann 2008) and Skyline(MacLean et al. 2010) are by far the most popular ones. These software tools usually offer GUIs that control running predefined analysis templates/workflows, including free parameters that need to be defined by the user. In parallel, projects like OpenMS(Röst et al. 2016) or pyteomics(Goloborodko et al. 2013) chose a fundamentally different approach. They aim at providing software libraries bound to specific programming languages like C++ or Python. Naturally, these offer greater analytical flexibility but require programming skills from the end-user and have therefore not reached the popularity of their GUI counterparts. Proteomics and metabolomics specific libraries have also been developed for the R statistical environment, but these mainly support high-level statistical analysis once the raw measurement data has undergone extensive preprocessing and aggregation by external software tools (often the GUI-based ones listed above). A typical example is the R package MSstats(Choi et al. 2014) for the statistical analysis of LC-MS experiments with complex designs or MSqRob(Goeminne, Gevaert, and Clement 2015). MSstats can process MaxQuant or Skyline outputs and creates protein/peptide level estimates for whether the biological system shows statistically significant regulation. In a nutshell, these tools provide statistical postprocessing. Libraries that support working with the spectral data in R also exist, for instance, the Bioconductor package MSnbase(Gatto and Lilley 2011). However, they require conversion of raw data to exchange formats like mzML, which is primarily supported by the ProteoWizard(Chambers et al. 2012) project and its software tool MSconvert.

We strongly believe that a library providing raw data reading would finally close the gap and facilitate modular end-to-end analysis pipeline development in R. This could be of special interest to research environments/projects dealing with either big data analytics or scientists interested in code prototyping without formal computer science education. Another key aspect regarding multi-omics integration is the fact that high-throughput genomic data analysis is already done mostly in R. This is primarily due to the Bioconductor project(Huber et al. 2015) that currently provides >1900 open-source software packages, training and teaching, and a very active user and developer community. Having these thoughts in mind, we decided to implement our R package named rawrr. rawrr utilizes a vendor-provided API named RawFileReader (Shofstahl 2016) to access spectral data logged in proprietary Thermo Fisher Scientific raw files. These binary files are among others written by all Orbitrap mass spectrometers, unlocking an incredible amount of the recent global LC-MS data, also stored in public repositories like ProteomeExchange. This manuscript presents a first package version/release and showcases its usage for bottom-up proteomics data analysis with a focus on Orbitrap data.

2 Implementation

Our implementation consists of two language layers, the top R layer and the hidden C# layer. Specifically, R functions requesting access to data stored in binary raw files (reader family functions listed in Table 1) invoke compiled C# wrapper methods using a system call. Calling a wrapper method typically results in the execution of methods defined in the RawFileReader dynamic link library provided by Thermo Fisher Scientific. Our precompiled wrapper methods are bundled in the rawrr executable file and shipped with the released R package. Running rawrr.exe requires the Mono environment on non-Microsoft operating systems. Mono is a cross platform, open source .NET framework. On Microsoft Windows the Microsoft .NET framework is typically already installed and sufficient. Our package also contains the C# source code (rawrr.cs), hopefully allowing other developers to follow and improve our code (open source). In order to return extracted data back to the R layer we use file I/O. More specifically, the extracted information is written to a temporary location on the harddrive, read back into memory and parsed into R objects.

Since mass spectrometry typically uses two basic data items, the mass spectrum and the mass chromatogram, we decided to implement corresponding objects following R’s S3 OOP system (Becker, Chambers, and Wilks 1988) named rawrrSpectrum and rawrrChromatogram. These objects function as simplistic interface to almost all data stored in raw-formatted files. The package provides functions to create and validate class instances. While class constructors primarily exist for (unit) testing purposes, instances are typically generated by the reader family of functions enumerated in Table 1 and returned as object sets (rawrrSpectrumSet, rawrrChromatogramSet). The names of objects encapsulated within rawrrSpectrum instances are keys returned by the RawFileReader API and the corresponding values become data parts of the objects, typically vectors of type numeric, logical or character. It needs to be mentioned that the rawrrSpectrum content partially depends on the instrument model and installed instrument control software version. For instance, the keys FAIMS Voltage On: and FAIMS CV: are only written by instruments that support FAIMS acquisition. We also implemented basic generics for printing and plotting of objects in base R to minimize dependencies.

2.0.1 Example data

The example file 20181113_010_autoQC01.raw used throughout this manuscript contains Fourier-transformed Orbitrap spectra (FTMS) recorded on a Thermo Fisher Scientific Q Exactive HF in positive mode (+). The mass spectrometer was operated in line with a nano UPLC and a nano electrospray source (NSI). MS2 spectra were generated by HCD fragmentation at normalized collision energy (NCE) of 27. All spectra were written to file after applying centroiding (c) and lock mass correction. The analyzed sample consisted of the iRT peptide mix (Biognosys) in a tryptic BSA digest (NEB) and was separated applying a 20 min linear gradient on C18 reversed-phase material at a constant flow rate of 300 nl/min. The file is part of the MassIVE dataset MSV000086542 (Kockmann 2020) and can be obtained through the FTP download link (MD5: a1f5df9627cf9e0d51ec1906776957ab). Individual scans have been assigned a universal spectrum identifier (USI)(Deutsch et al. 2020) by MassIVE.

If the environment variable MONO_PATH does not include a directory containing the RawFileReader .NET assemblies are installed in a directory derived by the rawrr::rawrrAssemblyPath() function.

if (isFALSE(rawrr::.checkDllInMonoPath())){
## removing files in directory '/home/biocbuild/.cache/R/rawrr/rawrrassembly'
##                   ThermoFisher.CommonCore.Data.dll 
##                                                  0 
## ThermoFisher.CommonCore.MassPrecisionEstimator.dll 
##                                                  0 
##          ThermoFisher.CommonCore.RawFileReader.dll 
##                                                  0
## MD5 59e3183200a0a835a60777bffe783339 /home/biocbuild/.cache/R/rawrr/rawrrassembly/rawrr.exe
## [1] 0

Additional raw data for demonstration and extended testing is available through the Bioconductor data package tartare (Panse and Kockmann 2019).

# fetch via ExperimentHub
eh <- ExperimentHub::ExperimentHub()
EH4547 <- normalizePath(eh[["EH4547"]])

(rawfile <- paste0(EH4547, ".raw"))
## [1] "/var/cache/biocbuild_cache/R/ExperimentHub/1450d77379d_4590.raw"
if (!file.exists(rawfile)){
  file.copy(EH4547, rawfile)

3 Results

The following sections are inspired by real-world research/infrastructure projects but have been stripped down to the bare scientific essentials to put more emphasis on the software application. We display source code in grey-shaded boxes, including syntax highlights. Corresponding R command line output starts with ## and is shown directly below the code fragment that triggered the output. All figures are generated using the generic plotting functions of the package. Additional graphical elements were added using base R plotting functions to increase the comprehensibility.

3.1 Use Case I - Analyzing Orbitrap Spectra

H <- rawrr::readFileHeader(rawfile = rawfile)

The Orbitrap detector has been a tremendous success story in MS, since it offers HRAM data on a time scale that is compatible with chromatographic analysis (LC-MS)(Makarov 2000) and is therefore heavily used in bottom-up proteomics. However, analyzing Orbitrap data in R has so far only been possible after raw data conversion to exchange formats like mz(X)ML. Unfortunately, conversion is accompanied by a loss of Orbitrap-specific information. This use case shows how easy it is to work directly with raw-formated Orbitrap data after installing our R package rawrr which applies vendor APIs for data access. We use a complete LC-MS run recorded on a NA by parallel reaction monitoring (PRM)(Gallien et al. 2012) for demonstration purposes (File name: NA). The NA min run resulted in NA scans that were written to the file. Already typesetting the above lines uses rawrr functionality, since instrument model, file name, time range of data acquisition, and number of scans is extracted from the binary file header (Note: This manuscript was written in R markdown and combines R code with narration). The respective function is called readFileHeader() and returns a simple R object of type list (see Table 1).

lists rawrr package functions connected to reading functionality. More details can be found in the package documentation (see supporting information, S-20 onwards (Kockmann and Panse 2021)).
Function Name Description Return value
readFileHeader() Reads meta information from a raw file header list
readIndex() Generates a scan index from a raw file data.frame
readSpectrum() Reads spectral data from a raw file rawrrSpectrum(Set)
readChromatogram() Reads chromatographic data from a raw file rawrrChromatogram(Set)

Individual scans or scan collections (sets) can be read by the function readSpectrum() which returns a rawrrSpectrum object or rawrrSpectrumSet. Our package also provides generics for printing and plotting these objects. The following code chunk depicts how a set of scans is read from the raw file (scan numbers were selected based on a database search). The corresponding Figure 1 shows the resulting plot for scan 9594 (USI: mzspec:MSV000086542:20181113_010_autoQC01:scan:9594:LGGNEQVTR/2) assigned to the doubly-charged iRT peptide LGGNEQVTR by MS-GF+ (Score: 144, SpecProb: 1.9e-12, DB E-Value: 4.4e-4, see MassIVE RMSV000000336.1 for details of the search):

i <- c(9594, 11113, 11884, 12788, 12677, 13204, 13868, 14551, 16136, 17193, 17612)
S <- rawrr::readSpectrum(rawfile = rawfile, scan = i)
## [1] "rawrrSpectrumSet"
## [1] "rawrrSpectrum"
## Total Ion Current:    62374688
## Scan Low Mass:    100
## Scan High Mass:   1015
## Scan Start Time (Min):    15.42
## Scan Number:  9594
## Base Peak Intensity:  12894520
## Base Peak Mass:   860.4223
## Scan Mode:    FTMS + c NSI Full ms2 487.2567@hcd27.00 [100.0000-1015.0000]
plot(S[[1]], centroid=TRUE)
Plot of scan number 9594 (USI: mzspec:MSV000086542:20181113_010_autoQC01:scan:9594) showing a centroided tandem mass spectrum of the iRT peptide precursor LGGNEQVTR++ in positive mode. The scan was acquired on an Orbitrap detector including lock mass correction and using a transient of 64 ms (equal to a resolving power of 30'000 at 200 m/z) and an AGC target of 1e5 elementary charges. Peak attributes like m/z, charge (z), and resolution (R) are shown above the peaks.

Figure 1: Plot of scan number 9594 (USI: mzspec:MSV000086542:20181113_010_autoQC01:scan:9594) showing a centroided tandem mass spectrum of the iRT peptide precursor LGGNEQVTR++ in positive mode
The scan was acquired on an Orbitrap detector including lock mass correction and using a transient of 64 ms (equal to a resolving power of 30’000 at 200 m/z) and an AGC target of 1e5 elementary charges. Peak attributes like m/z, charge (z), and resolution (R) are shown above the peaks.

The plot shows typical Orbitrap peak attributes like resolution (R) and charge (z) above the most intense peaks when centroided data is available and selected. Centroided data also makes it possible to graph spectra using signal-to-noise as response value. This is potentially interesting since Orbitrap detectors follow \(S/N \propto charges \cdot \sqrt R\) with \(S/N\) being the signal to noise ratio, \(charges\) meaning elementary charges in the Orbitrap, and \(R\) being the resolving power of the scan (Kelstrup et al. 2014). Signal-to-noise makes judging the signal quantity more intuitive than using arbitrary signal intensity units. Figure 2 shows that all y-ion signals are several ten or even hundred folds above the noise estimate (Note: The spectrum shown in Figure 2 is decorated with y-ions calculated for LGGNEQVTR++ using the protViz package.). By adding some diagnostic information related to the Orbitrap’s advanced gain control (AGC) system, it becomes apparent that the C-trap managed to collect the defined 100,000 charges within 2.8 ms, corresponding to only ~5% of the maximum injection time of 55 ms.

plot(S[[1]], centroid=TRUE, SN = TRUE, diagnostic = TRUE)
# S/N threshold indicator
abline(h = 5, lty = 2, col = "blue")
# decorate plot with y-ion series of target peptide LGGNEQVTR++
(yIonSeries <- protViz::fragmentIon("LGGNEQVTR")[[1]]$y[1:8])
## [1] 175.1190 276.1666 375.2350 503.2936 632.3362 746.3791 803.4006 860.4221
names(yIonSeries) <- paste0("y", seq(1, length(yIonSeries)))
abline(v = yIonSeries, col='#DDDDDD88', lwd=5)
axis(3, yIonSeries, names(yIonSeries))