# 1 Introduction

Label-Free Quantitative mass spectrometry based workflows for differential expression (DE) analysis of proteins is often challenging due to peptide-specific effects and context-sensitive missingness of peptide intensities.

• msqrob2 provides peptide-based workflows that can assess for DE directly from peptide intensities and outperform summarisation methods which first aggregate MS1 peptide intensities to protein intensities before DE analysis. However, they are computationally expensive, often hard to understand for the non-specialised end-user, and they do not provide protein summaries, which are important for visualisation or downstream processing.

• msqrob2 therefore also proposes a novel summarisation strategy, which estimates MSqRob’s model parameters in a two-stage procedure circumventing the drawbacks of peptide-based workflows.

• the summarisation based workflow in msqrob2 maintains MSqRob’s superior performance, while providing useful protein expression summaries for plotting and downstream analysis. Summarising peptide to protein intensities considerably reduces the computational complexity, the memory footprint and the model complexity. Moreover, it renders the analysis framework to become modular, providing users the flexibility to develop workflows tailored towards specific applications.

In this vignette we will demonstrate how to perform msqrob’s summarisation based workflow starting from a Maxquant search on a subset of the cptac spike-in study.

Examples on our peptide-based workflows and on the analysis of more complex designs can be found on our companion website msqrob2Examples.

Technical details on our methods can be found in (Goeminne, Gevaert, and Clement 2016), (Goeminne et al. 2020) and (Sticker et al. 2020).

# 2 Background

This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/$$\mu$$L Saccharomyces cerevisiae strain BY4741. Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/$$\mu$$L) and 6B (0.74 fmol UPS1 proteins/$$\mu$$L) [5]. We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016) [1]. Three replicates are available for each concentration.

# 3 Data

We first import the data from peptideRaws.txt file. This is the file containing your peptideRaw-level intensities. For a MaxQuant search [6], this peptideRaws.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where the raw files were saved. In this vignette, we use a MaxQuant peptideRaws file which is a subset of the cptac study. This data is available in the msdata package. To import the data we use the QFeatures package.

We generate the object peptideRawFile with the path to the peptideRaws.txt file. We find the columns that contain the expression data of the peptideRaws in the peptideRaws.txt file using the grep function and we search for the pattern Intensity\\., which will return the position of the columns that include Intensity. in their column name, which is the convention by MaxQuant.

library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(gridExtra)

peptidesFile <- msdata::quant(pattern = "cptac_a_b_peptides", full.names = TRUE)

table = peptidesFile, fnames = 1, ecol = ecols,
name = "peptideRaw", sep = "\t"
)

In the following code chunk, we can extract the spikein condition from the raw file name.

cond <- which(strsplit(colnames(pe)[[1]][1], split = "")[[1]] == "A") # find where condition is stored
colData(pe)$condition <- substr(colnames(pe), cond, cond) %>% unlist() %>% as.factor() We calculate how many non zero intensities we have per peptide and this will be useful for filtering. rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)

Peptides with zero intensities are missing peptides and should be represent with a NA value rather than 0.

pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA

## 3.1 Data exploration

We can inspect the missingness in our data with the plotNA() function provided with MSnbase. 45% of all peptide intensities are missing and for some peptides we do not even measure a signal in any sample. The missingness is similar across samples.

MSnbase::plotNA(assay(pe[["peptideRaw"]])) +
xlab("Peptide index (ordered by data completeness)")

# 4 Preprocessing

This section preforms standard preprocessing for the peptide data. This include log transformation, filtering and summarisation of the data.

## 4.1 Log transform the data

pe <- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog")
limma::plotDensities(assay(pe[["peptideLog"]]))