Foreword

This document walks users through a typical pipeline for adding annotation information to spatial proteomics data. For a general practical introduction to pRoloc and spatial proteomics data analysis, readers are referred to the tutorial, available using vignette("pRoloc-tutorial", package = "pRoloc").

1 Introduction

Exploring protein annotations and defining sub-cellular localisation markers (i.e. known residents of a specific sub-cellular niche in a species, under a condition of interest) play important roles in the analysis of spatial proteomics data. The latter is essential for downstream supervised machine learning (ML) classification for protein localisation prediction (see vignette("pRoloc-tutorial", package = "pRoloc") and vignette("pRoloc-ml", package = "pRoloc") for information on available ML methods) and the former is interesting for initial biological interpretation through matching annotations to the data structure.

Robust protein-localisation prediction is reliant on markers that reflect the true sub-cellular diversity of the multivariate data. The validity of markers is generally assured by expert curation. This can be time consuming and difficult owing to the limited number of marker proteins that exist in databases and elsewhere. The Gene Ontology (GO) database, and in particular the cellular compartment (CC) namespace provide a good starting point for protein annotation and marker definition. Nevertheless, automatic extraction from databases, and in particular GO CC, is only a first step in sub-cellular localisation analysis and requires additional curation to counter unreliable annotation based on data that is inaccurate or out of context for the biological question under investigation.

To facilitate the above, we have developed an annotation retrieval and management system that provides a flexible framework for the exploration of the sub-cellular proteomics data. We have developed a method to correlate annotation information with the multivariate data space to identify densely annotated regions and assess cluster tightness. Given a set of proteins that share some property e.g. a specified GO term, a k-means clustering is used to fit the data (testing k = 1:5) and then for each number of k components tested, all pairwise Euclidean distances are calculated per component, and then normalised. The minimum mean normalised distance is then extracted and used as a measure of cluster tightness. This is repeated for all protein/annotation sets. These sets are then ranked according to minimum mean normalised distance and then can be displayed and explored using the pRolocGUI package.

In this vignette we present a step-by-step guide showing users how to (1) how to add protein annotations, here we use the GO database as an example, and (2) rank and order information (e.g. GO terms) according to their correlation with the data structure, for the extraction of optimal data specific annotated clusters.

2 Loading the data

We will demonstrate our pipeline for adding and ranking annotation information using a LOPIT experiment on Pluripotent Mouse Embryonic stems (Christoforou et al 2016), available and documented in the pRolocdata data package as hyperlopit2015.

library("pRoloc")
library("pRolocdata")

## Subset data for markers for example
data("hyperLOPIT2015")
hyperLOPIT2015 <- markerMSnSet(hyperLOPIT2015)

3 Adding sub-cellular localisation information

All GO terms associated to proteins that appear in the dataset are retrieved and used to create a binary matrix where a 1 (0) at position \((i,j)\) indicates that term \(j\) has (not) been used to annotate feature \(i\). This matrix is appended and stored in the feature data slot of the MSnSet dataset using the addGoAnnotations function. We first however need to prepare annotation parameters that will enable us to query the Biomart repository using the package, from where we are able to retrieve GO terms. The specific Biomart repository and query will depend on the species under study and the type of features. This can be set using the setAnnotationParams function.

In the code chunk below we set the annotation parameters for the hyperLOPIT2015 dataset. As this species used was mouse and the featureNames of the hyperLOPIT2015 dataset are Uniprot accession numbers the input to the function is defined as inputs = c("Mus musculus", "UniProtKB/Swiss-Prot ID"). See ?setAnnotationParams for details.

params <- setAnnotationParams(inputs = c("Mus musculus", 
                                         "UniProtKB/Swiss-Prot ID"))
## Using species Mus musculus genes (GRCm38.p4)
## Using feature type UniProtKB/Swiss-Prot ID(s) [e.g. A0A0A6YXX9]
## Connecting to Biomart...

Now the parameters for the search have been defined we can use the addGoAnnotations function to add a GO information matrix to the featureData slot of the dataset. The addGoAnnotations function takes a MSnSet instance as input (from which the featureNames will be extracted) and it downloads the CC terms (the default, biological process and the molecular function namespaces are also supported) found for each protein in the dataset. The output MSnSet has the CC term binary matrix appended to the fData, by default this is called GOAnnotations (and changed using the fcol argument).

cc <- addGoAnnotations(hyperLOPIT2015, params, 
                       namespace = "cellular_component")
fvarLabels(cc)
##  [1] "entry.name"                "protein.description"      
##  [3] "peptides.rep1"             "peptides.rep2"            
##  [5] "psms.rep1"                 "psms.rep2"                
##  [7] "phenodisco.input"          "phenodisco.output"        
##  [9] "curated.phenodisco.output" "markers"                  
## [11] "svm.classification"        "svm.score"                
## [13] "svm.top.quartile"          "final.assignment"         
## [15] "first.evidence"            "curated.organelles"       
## [17] "cytoskeletal.components"   "trafficking.proteins"     
## [19] "protein.complexes"         "signalling.cascades"      
## [21] "oct4.interactome"          "nanog.interactome"        
## [23] "sox2.interactome"          "cell.surface.proteins"    
## [25] "markers2015"