Package: PSMatch
Authors: Laurent Gatto [aut, cre] (https://orcid.org/0000-0002-1520-2268), Johannes Rainer [aut] (https://orcid.org/0000-0002-6977-7147), Sebastian Gibb [aut] (https://orcid.org/0000-0001-7406-4443), Samuel Wieczorek [ctb], Thomas Burger [ctb]
Last modified: 2024-10-29 16:29:16.610597
Compiled: Tue Oct 29 22:15:22 2024

1 Introduction

This vignette is one among several illustrating how to use the PSMatch package, focusing on the modelling peptide-protein relations using adjacency matrices and connected componencts. For a general overview of the package, see the PSMatch package manual page (?PSMatch) and references therein.

2 Peptide-protein relation

Let’s start by loading and filter PSM data as illustrated in the Working with PSM data vignette.

library("PSMatch")
id <- msdata::ident(full.names = TRUE, pattern = "TMT") |>
PSM() |>
filterPsmDecoy() |>
filterPsmRank()
## Loading required namespace: mzR
## Removed 2896 decoy hits.
## Removed 155 PSMs with rank > 1.
id
## PSM with 2751 rows and 35 columns.
## names(35): sequence spectrumID ... subReplacementResidue subLocation

When identification data is stored as a table, the relation between peptides is typically encode in two columns, once containing the peptide sequences and the second the protein identifiers these peptides stem from. Below are the 10 first observations of our identification data table.

data.frame(id[1:10, c("sequence", "DatabaseAccess")])
##                             sequence DatabaseAccess
## 1                       RQCRTDFLNYLR        ECA2006
## 2             ESVALADQVTCVDWRNRKATKK        ECA1676
## 3             QRMARTSDKQQSIRFLERLCGR        ECA3009
## 4          DGGPAIYGHERVGRNAKKFKCLKFR        ECA1420
## 5             QRMARTSDKQQSIRFLERLCGR        ECA3009
## 6  CIDRARHVEVQIFGDGKGRVVALGERDCSLQRR        ECA2142
## 7  CIDRARHVEVQIFGDGKGRVVALGERDCSLQRR        ECA2142
## 8               VGRCRPIINYLASPGGTSER        ECA0331
## 9                QRLDEHCVGVGQNALLLGR        ECA3680
## 10  VDYQGKKVVIIGLGLTGLSCVDFFLARGVVPR        ECA3817

This information can however also be encoded as an adjacency matrix with peptides along the rows and proteins along the columns, and a 1 (or more generally a value > 0) indicating that a peptides belongs to the corresponding proteins. Such a matrix is created below for our identification data.

adj <- makeAdjacencyMatrix(id)
dim(adj)
## [1] 2357 1504
adj[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                                   ECA2006 ECA1676 ECA3009 ECA1420 ECA2142
## RQCRTDFLNYLR                            1       .       .       .       .
## ESVALADQVTCVDWRNRKATKK                  .       1       .       .       .
## QRMARTSDKQQSIRFLERLCGR                  .       .       2       .       .
## DGGPAIYGHERVGRNAKKFKCLKFR               .       .       .       1       .
## CIDRARHVEVQIFGDGKGRVVALGERDCSLQRR       .       .       .       .       2

This matrix models the relation between the 2357 peptides and the 1504 is our identification data. These numbers can be verified by checking the number of unique peptides sequences and database accession numbers.

length(unique(id$sequence))
## [1] 2357
length(unique(id$DatabaseAccess))
## [1] 1504

Some values are > 1 because some peptide sequences are observed more than oncce, for example carrying different modification or the same one at different sites. The adjacency matrix can be made binary by setting madeAdjacencyMatrix(id, binary = TRUE).

This large matrix is too large to be explored manually and is anyway not interesting on its own. Subsets of this matrix that define proteins defines by a set of peptides (whether shared or unique) is relevant. These are represented by subsets of this large matrix named connected component. We can easily compute all these connected components to produce the multiple smaller and relevant adjacency matrices.

cc <- ConnectedComponents(adj)
length(cc)
## [1] 1476
cc
## An instance of class ConnectedComponents 
##  Number of proteins: 1504 
##  Number of components: 1476 
##  Number of components [peptide x proteins]:
##   954[1 x 1] 7[1 x n] 501[n x 1] 14[n x n]

Among the 2357 and the 1504 proteins, we have 1476 connected components.

954 thereof, such as the one shown below, correspond to single proteins identified by a single peptide:

connectedComponents(cc, 1)
## 1 x 1 sparse Matrix of class "dgCMatrix"
##                        ECA0003
## KTLGAYDFSFGEGIYTHMKALR       1

7 thereof represent protein groups identified by a single shared peptide:

connectedComponents(cc, 527)
## 1 x 2 sparse Matrix of class "dgCMatrix"
##            ECA1637 ECA2914
## KEIILNKNEK       1       1

501 represent single proteins identified by multiple unique peptides:

connectedComponents(cc, 38)
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                               ECA0130
## GKIECNLRFELDPSAQSALILNEKLAK         1
## ERLQSKLEDAQVQLENNRLEQELVLMAQR       1
## GNWGSAAWELRSVNQR                    1
## ILKKEEAVGRR                         1
## ERIRARLTR                           1

Finally, arguable those that warrant additional exploration are those that are composed of multiple peptides and multiple proteins. There are 14 thereof in this identification; here’s an example:

connectedComponents(cc, 920)
## 4 x 2 sparse Matrix of class "dgCMatrix"
##                    ECA2869 ECA4278
## QKTRCATRAFKANKGRAR       1       1
## IDFLRDPKGYLR             1       .
## RFKQKTR                  .       1
## LIRKQVVQPGYR             1       1

3 Visualising adjacency matrices

Let’s identify the connected components that have at least 3 peptides (i.e. rows in the adjacency matrix) and 3 proteins (i.e. columns in the adjacency matrix).

(i <- which(nrows(cc) > 2 & ncols(cc) > 2))
## [1] 1079 1082
dims(cc)[i, ]
##      nrow ncol
## [1,]    3    4
## [2,]    7    4

We will use the second adjacency matrix, with index 1082 to learn about the plotAdjacencyMatrix() function and explore how to inform our peptides filtering beyond the filterPsm*() functions.

cx <- connectedComponents(cc, 1082)
cx
## 7 x 4 sparse Matrix of class "dgCMatrix"
##                     ECA3406 ECA3415 ECA3389 ECA3399
## THPAERKPRRRKKR            1       1       .       .
## KPTARRRKRK                .       .       2       .
## PLAQGGQLNRLSAIRGLFR       1       1       .       .
## RRKRKPDSLKK               .       .       1       .
## KPRRRK                    1       1       .       .
## VVPVGLRALVWVQR            1       1       1       1
## KLKPRRR                   .       .       .       1

We can now visualise the the cx adjacency matrix with the plotAdjacencyMatrix() function. The nodes of the graph represent proteins and petides - by default, proteins are shown as blue squares and peptides as white circles. Edge connect peptides/circles to proteins/squares, indicating that a peptide belongs to a protein.

plotAdjacencyMatrix(cx)

We can immediately observe that peptide VVPVGLRALVWVQR is associated to all four proteins; it holds that protein group together, defines that connected component formed by these four proteins. If we were to drop that peptides, we would obtain two single proteins, ECA3399 (defined by KLKPRRR), ECA3398 (defined by RRKRKPDSLKK and KPTARRRKRK) and a protein group formed of ECA3415 and ECA3406 (defined by three shared peptides).

3.1 Colouring the graph nodes

To help with the interpretation of the graph and the potential benefits of additional manual peptide filtering, it is possible to customise the node colours. Protein and peptide node colours can be controlled with the protColors and pepColors arguments respectively. Let’s start with the former.

3.1.1 Colouring protein nodes

protColors can either be a numeric or a character. The default value is 0, which produces the figure above. Any value > 0 will lead to more proteins being highlighted using different colours. Internally, string distances between protein names are computed and define if proteins should be coded with the same colours (if they are separated by small distances, i.e. they have similar names) or different colours (large distance, dissimilar names).

By setting the argument to 1, we see that proteins starting with ECA33 and those starting with ECA34 are represented with different colours.

plotAdjacencyMatrix(cx, 1)