1 The scp data framework

Our data structure is relying on two curated data classes: QFeatures (Gatto and Vanderaa (2023)) and SingleCellExperiment (Amezquita et al. (2020)). QFeatures is dedicated to the manipulation and processing of MS-based quantitative data. It explicitly records the successive steps to allow users to navigate up and down the different MS levels. SingleCellExperiment is another class designed as an efficient data container that serves as an interface to state-of-the-art methods and algorithms for single-cell data. Our framework combines the two classes to inherit from their respective advantages.

Because mass spectrometry (MS)-based single-cell proteomics (SCP) only captures the proteome of between one and a few tens of single-cells in a single run, the data is usually acquired across many MS batches. Therefore, the data for each run should conceptually be stored in its own container, that we here call a set. The expected input for working with the scp package is quantification data of peptide to spectrum matches (PSM). These data can then be processed to reconstruct peptide and protein data. The links between related features across different sets are stored to facilitate manipulation and visualization of of PSM, peptide and protein data. This is conceptually shown below.

The `scp` framework relies on `SingleCellExperiment` and `QFeatures` objects

Figure 1: The scp framework relies on SingleCellExperiment and QFeatures objects

The main input table required for starting an analysis with scp is called the assayData.

2 assayData table

The assayData table is generated after the identification and quantification of the MS spectra by a pre-processing software such as MaxQuant, ProteomeDiscoverer or MSFragger (the list of available software is actually much longer). We will here use as an example a data table that has been generated by MaxQuant. The table is available from the scp package and is called mqScpData (for MaxQuant generated SCP data).

library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361  149

In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. There are three types of columns:

  • Quantification columns (quantCols): 1 to n (depending on technology)
  • Run identifier column (runCol): e.g. file name
  • Feature annotations: e.g. peptide sequence, ion charge, protein name