readSCP
scp 1.10.0
scp
data frameworkOur data structure is relying on two curated data classes: QFeatures
(Gatto (2020)) and SingleCellExperiment
(Amezquita et al. (2019)).
QFeatures
is dedicated to the manipulation and processing of
MS-based quantitative data. It explicitly records the successive steps
to allow users to navigate up and down the different MS levels.
SingleCellExperiment
is another class designed as an efficient data
container that serves as an interface to state-of-the-art methods and
algorithms for single-cell data. Our framework combines the two
classes to inherit from their respective advantages.
Because mass spectrometry (MS)-based single-cell proteomics (SCP) only
captures the proteome of between one and a few tens of single-cells in
a single run, the data is usually acquired across many MS batches.
Therefore, the data for each run should conceptually be stored in its
own container, that we here call an assay. The expected input for
working with the scp
package is quantification data of peptide to
spectrum matches (PSM). These data can then be processed to reconstruct
peptide and protein data. The links between related features across
different assays are stored to facilitate manipulation and
visualization of of PSM, peptide and protein data. This is
conceptually shown below.
Figure 1: The scp
framework relies on SingleCellExperiment
and QFeatures
objects
There are two input tables required for starting an analysis with
scp
:
The input table is generated after the identification and
quantification of the MS spectra by a pre-processing software such as
MaxQuant, ProteomeDiscoverer or MSFragger (the
list
of available software is actually much longer). We will here use as an
example a data table that has been generated by MaxQuant. The table is
available from the scp
package and is called mqScpData
(for
MaxQuant generated SCP data).
library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361 149
In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. There are three types of columns:
Figure 2: Conceptual representation of the input table
The quantification data can be composed of one (in case of label-free
acquisition) up to 16 columns (in case of TMT-16 multiplexing). The
columns holding the quantification start with Reporter.intensity.
followed by a number.
(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
value = TRUE))
#> [1] "Reporter.intensity.1" "Reporter.intensity.2" "Reporter.intensity.3"
#> [4] "Reporter.intensity.4" "Reporter.intensity.5" "Reporter.intensity.6"
#> [7] "Reporter.intensity.7" "Reporter.intensity.8" "Reporter.intensity.9"
#> [10] "Reporter.intensity.10" "Reporter.intensity.11" "Reporter.intensity.12"
#> [13] "Reporter.intensity.13" "Reporter.intensity.14" "Reporter.intensity.15"
#> [16] "Reporter.intensity.16"
As you may notice, the example data was acquired using a TMT-16 protocol since we retrieve 16 quantification columns. Actually, some runs were acquired using a TMT-11 protocol (11 labels) but we will come back to this later.
head(mqScpData[, quantCols])
#> Reporter.intensity.1 Reporter.intensity.2 Reporter.intensity.3
#> 1 61251 501.71 3731.3
#> 2 58648 1099.80 2837.7
#> 3 150350 3705.00 9361.0
#> 4 27347 405.90 1525.2
#> 5 84035 583.09 4092.3
#> 6 44895 700.23 2283.0
#> Reporter.intensity.4 Reporter.intensity.5 Reporter.intensity.6
#> 1 1643.30 871.84 981.87
#> 2 494.32 349.26 1030.50
#> 3 0.00 1945.40 1188.60
#> 4 0.00 0.00 318.74
#> 5 530.13 718.13 2204.50
#> 6 1109.60 0.00 675.79
#> Reporter.intensity.7 Reporter.intensity.8 Reporter.intensity.9
#> 1 1200.10 939.06 1457.50
#> 2 0.00 1214.10 800.58
#> 3 1574.00 2302.10 2176.10
#> 4 0.00 519.81 0.00
#> 5 960.51 453.77 1188.40
#> 6 0.00 809.38 668.88
#> Reporter.intensity.10 Reporter.intensity.11 Reporter.intensity.12
#> 1 1329.80 981.83 NA
#> 2 807.79 391.38 NA
#> 3 1399.50 1307.50 2192.4
#> 4 507.23 370.79 NA
#> 5 740.99 0.00 NA
#> 6 1467.50 901.38 NA
#> Reporter.intensity.13 Reporter.intensity.14 Reporter.intensity.15
#> 1 NA NA NA
#> 2 NA NA NA
#> 3 1791.4 1727.5 2157.3
#> 4 NA NA NA
#> 5 NA NA NA
#> 6 NA NA NA
#> Reporter.intensity.16
#> 1 NA
#> 2 NA
#> 3 1398
#> 4 NA
#> 5 NA
#> 6 NA
Most columns in the mqScpData
table contain information used or
generated during the identification of the MS spectra. For instance,
you may find the charge of the parent ion, the score and probability
of a correct match between the MS spectrum and a peptide sequence, the
sequence of the best matching peptide, its length, its modifications,
the retention time of the peptide on the LC, the protein(s) the peptide
originates from and much more.
head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
"Retention.time", "Proteins")])
#> Charge Score PEP Sequence Length Retention.time
#> 1 2 41.029 5.2636e-04 ATNFLAHEK 9 65.781
#> 2 2 44.349 5.8789e-04 ATNFLAHEK 9 63.787
#> 3 2 51.066 4.0315e-24 SHTILLVQPTK 11 71.884
#> 4 2 63.816 4.7622e-06 SHTILLVQPTK 11 68.633
#> 5 2 74.464 6.8709e-09 SHTILLVQPTK 11 71.946
#> 6 2 41.502 5.3705e-02 SLVIPEK 7 76.204
#> Proteins
#> 1 sp|P29692|EF1D_HUMAN
#> 2 sp|P29692|EF1D_HUMAN
#> 3 sp|P84090|ERH_HUMAN
#> 4 sp|P84090|ERH_HUMAN
#> 5 sp|P84090|ERH_HUMAN
#> 6 sp|P62269|RS18_HUMAN
This type of annotation is related to the MS instrument. In MaxQuant, only the file name generated by the MS instrument is stored. There is one file for each MS run, hence the file name can be used as a batch identifier.
unique(mqScpData$Raw.file)
#> [1] "190321S_LCA10_X_FP97AG" "190222S_LCA9_X_FP94BM"
#> [3] "190914S_LCB3_X_16plex_Set_21" "190321S_LCA10_X_FP97_blank_01"
The sample table contains the experimental design generated by the researcher. The rows of the sample table correspond to a sample in the experiment and the columns correspond to the available annotations about the sample. We will here use the second example table:
data("sampleAnnotation")
head(sampleAnnotation)
#> Raw.file Channel SampleType lcbatch sortday digest
#> 1 190222S_LCA9_X_FP94BM Reporter.intensity.1 Carrier LCA9 s8 N
#> 2 190222S_LCA9_X_FP94BM Reporter.intensity.2 Reference LCA9 s8 N
#> 3 190222S_LCA9_X_FP94BM Reporter.intensity.3 Unused LCA9 s8 N
#> 4 190222S_LCA9_X_FP94BM Reporter.intensity.4 Monocyte LCA9 s8 N
#> 5 190222S_LCA9_X_FP94BM Reporter.intensity.5 Blank LCA9 s8 N
#> 6 190222S_LCA9_X_FP94BM Reporter.intensity.6 Monocyte LCA9 s8 N
This table may contain any information about the samples. For example,
useful information could be the type of sample that is analysed, a
phenotype known from the experimental design, the MS batch, the
acquisition date, MS settings used to acquire the sample, the LC
batch, the sample preparation batch, etc… However, scp
requires 2 specific fields in the sample annotations:
scp
the names of the columns in the feature
data holds the quantification of the corresponding sample.Raw.file
in this case).
It must have the same name as the name of the column containing the
MS run names in the quantification table.These two columns allow scp
to correctly split and match data that
were acquired across multiple acquisition runs.