1 The scp data framework

Our data structure is relying on two curated data classes: QFeatures (Gatto (2020)) and SingleCellExperiment (Amezquita et al. (2019)). QFeatures is dedicated to the manipulation and processing of MS-based quantitative data. It explicitly records the successive steps to allow users to navigate up and down the different MS levels. SingleCellExperiment is another class designed as an efficient data container that serves as an interface to state-of-the-art methods and algorithms for single-cell data. Our framework combines the two classes to inherit from their respective advantages.

Because mass spectrometry (MS)-based single-cell proteomics (SCP) only captures the proteome of between one and a few tens of single-cells in a single run, the data is usually acquired across many MS batches. Therefore, the data for each run should conceptually be stored in its own container, that we here call an assay. The expected input for working with the scp package is quantification data of peptide to spectrum matches (PSM). These data can then be processed to reconstruct peptide and protein data. The links between related features across different assays are stored to facilitate manipulation and visualization of of PSM, peptide and protein data. This is conceptually shown below.

The `scp` framework relies on `SingleCellExperiment` and `QFeatures` objects

Figure 1: The scp framework relies on SingleCellExperiment and QFeatures objects

There are two input tables required for starting an analysis with scp:

  1. The input table
  2. The sample table

2 Input table

The input table is generated after the identification and quantification of the MS spectra by a pre-processing software such as MaxQuant, ProteomeDiscoverer or MSFragger (the list of available software is actually much longer). We will here use as an example a data table that has been generated by MaxQuant. The table is available from the scp package and is called mqScpData (for MaxQuant generated SCP data).

library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361  149

In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. There are three types of columns:

  • Feature quantification: 1 to n (depending on technology)
  • Feature annotations: e.g. peptide sequence, ion charge, protein name
  • Acquisition annotations: e.g. file name
Conceptual representation of the input table

Figure 2: Conceptual representation of the input table

2.0.1 Feature quantifications

The quantification data can be composed of one (in case of label-free acquisition) up to 16 columns (in case of TMT-16 multiplexing). The columns holding the quantification start with Reporter.intensity. followed by a number.

(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
                  value = TRUE))
#>  [1] "Reporter.intensity.1"  "Reporter.intensity.2"  "Reporter.intensity.3" 
#>  [4] "Reporter.intensity.4"  "Reporter.intensity.5"  "Reporter.intensity.6" 
#>  [7] "Reporter.intensity.7"  "Reporter.intensity.8"  "Reporter.intensity.9" 
#> [10] "Reporter.intensity.10" "Reporter.intensity.11" "Reporter.intensity.12"
#> [13] "Reporter.intensity.13" "Reporter.intensity.14" "Reporter.intensity.15"
#> [16] "Reporter.intensity.16"

As you may notice, the example data was acquired using a TMT-16 protocol since we retrieve 16 quantification columns. Actually, some runs were acquired using a TMT-11 protocol (11 labels) but we will come back to this later.

head(mqScpData[, quantCols])
#>   Reporter.intensity.1 Reporter.intensity.2 Reporter.intensity.3
#> 1                61251               501.71               3731.3
#> 2                58648              1099.80               2837.7
#> 3               150350              3705.00               9361.0
#> 4                27347               405.90               1525.2
#> 5                84035               583.09               4092.3
#> 6                44895               700.23               2283.0
#>   Reporter.intensity.4 Reporter.intensity.5 Reporter.intensity.6
#> 1              1643.30               871.84               981.87
#> 2               494.32               349.26              1030.50
#> 3                 0.00              1945.40              1188.60
#> 4                 0.00                 0.00               318.74
#> 5               530.13               718.13              2204.50
#> 6              1109.60                 0.00               675.79
#>   Reporter.intensity.7 Reporter.intensity.8 Reporter.intensity.9
#> 1              1200.10               939.06              1457.50
#> 2                 0.00              1214.10               800.58
#> 3              1574.00              2302.10              2176.10
#> 4                 0.00               519.81                 0.00
#> 5               960.51               453.77              1188.40
#> 6                 0.00               809.38               668.88
#>   Reporter.intensity.10 Reporter.intensity.11 Reporter.intensity.12
#> 1               1329.80                981.83                    NA
#> 2                807.79                391.38                    NA
#> 3               1399.50               1307.50                2192.4
#> 4                507.23                370.79                    NA
#> 5                740.99                  0.00                    NA
#> 6               1467.50                901.38                    NA
#>   Reporter.intensity.13 Reporter.intensity.14 Reporter.intensity.15
#> 1                    NA                    NA                    NA
#> 2                    NA                    NA                    NA
#> 3                1791.4                1727.5                2157.3
#> 4                    NA                    NA                    NA
#> 5                    NA                    NA                    NA
#> 6                    NA                    NA                    NA
#>   Reporter.intensity.16
#> 1                    NA
#> 2                    NA
#> 3                  1398
#> 4                    NA
#> 5                    NA
#> 6                    NA

2.0.2 Feature annotations

Most columns in the mqScpData table contain information used or generated during the identification of the MS spectra. For instance, you may find the charge of the parent ion, the score and probability of a correct match between the MS spectrum and a peptide sequence, the sequence of the best matching peptide, its length, its modifications, the retention time of the peptide on the LC, the protein(s) the peptide originates from and much more.

head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
                   "Retention.time", "Proteins")])
#>   Charge  Score        PEP    Sequence Length Retention.time
#> 1      2 41.029 5.2636e-04   ATNFLAHEK      9         65.781
#> 2      2 44.349 5.8789e-04   ATNFLAHEK      9         63.787
#> 3      2 51.066 4.0315e-24 SHTILLVQPTK     11         71.884
#> 4      2 63.816 4.7622e-06 SHTILLVQPTK     11         68.633
#> 5      2 74.464 6.8709e-09 SHTILLVQPTK     11         71.946
#> 6      2 41.502 5.3705e-02     SLVIPEK      7         76.204
#>               Proteins
#> 1 sp|P29692|EF1D_HUMAN
#> 2 sp|P29692|EF1D_HUMAN
#> 3  sp|P84090|ERH_HUMAN
#> 4  sp|P84090|ERH_HUMAN
#> 5  sp|P84090|ERH_HUMAN
#> 6 sp|P62269|RS18_HUMAN

2.0.3 Acquisition annotations

This type of annotation is related to the MS instrument. In MaxQuant, only the file name generated by the MS instrument is stored. There is one file for each MS run, hence the file name can be used as a batch identifier.

unique(mqScpData$Raw.file)
#> [1] "190321S_LCA10_X_FP97AG"        "190222S_LCA9_X_FP94BM"        
#> [3] "190914S_LCB3_X_16plex_Set_21"  "190321S_LCA10_X_FP97_blank_01"

3 Sample table

The sample table contains the experimental design generated by the researcher. The rows of the sample table correspond to a sample in the experiment and the columns correspond to the available annotations about the sample. We will here use the second example table:

data("sampleAnnotation")
head(sampleAnnotation)
#>                Raw.file              Channel SampleType lcbatch sortday digest
#> 1 190222S_LCA9_X_FP94BM Reporter.intensity.1    Carrier    LCA9      s8      N
#> 2 190222S_LCA9_X_FP94BM Reporter.intensity.2  Reference    LCA9      s8      N
#> 3 190222S_LCA9_X_FP94BM Reporter.intensity.3     Unused    LCA9      s8      N
#> 4 190222S_LCA9_X_FP94BM Reporter.intensity.4   Monocyte    LCA9      s8      N
#> 5 190222S_LCA9_X_FP94BM Reporter.intensity.5      Blank    LCA9      s8      N
#> 6 190222S_LCA9_X_FP94BM Reporter.intensity.6   Monocyte    LCA9      s8      N

This table may contain any information about the samples. For example, useful information could be the type of sample that is analysed, a phenotype known from the experimental design, the MS batch, the acquisition date, MS settings used to acquire the sample, the LC batch, the sample preparation batch, etc… However, scp requires 2 specific fields in the sample annotations:

  1. One column that tells scp the names of the columns in the feature data holds the quantification of the corresponding sample.
  2. One column containing the MS run names (Raw.file in this case). It must have the same name as the name of the column containing the MS run names in the quantification table.

These two columns allow scp to correctly split and match data that were acquired across multiple acquisition runs.