1 Introduction

For a general practical introduction to pRoloc, readers are referred to the tutorial, available using vignette("pRoloc-tutorial", package = "pRoloc"). The following document provides a overview of the algorithms available in the package. The respective section describe unsupervised machine learning (USML), supervised machine learning (SML), semi-supervised machine learning (SSML) as implemented in the novelty detection algorithm and transfer learning.

2 Data sets

We provide 62 test data sets in the pRolocdata package that can be readily used with pRoloc. The data set can be listed with pRolocdata and loaded with the data function. Each data set, including its origin, is individually documented.

The data sets are distributed as MSnSet instances. Briefly, these are dedicated containers for quantitation data as well as feature and sample meta-data. More details about MSnSets are available in the pRoloc tutorial and in the MSnbase package, that defined the class.

## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: X114 X115 X116 X117
##   varLabels: Fractions
##   varMetadata: labelDescription
## featureData
##   featureNames: P20353 P53501 ... P07909 (888 total)
##   fvarLabels: FBgn Protein.ID ... markers.tl (16 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 19317464 
## Annotation:  
## - - - Processing information - - -
## Added markers from  'mrk' marker vector. Thu Jul 16 22:53:44 2015 
##  MSnbase version: 1.17.12

Other omics data

While our primary biological domain is quantitative proteomics, with special emphasis on spatial proteomics, the underlying class infrastructure on which pRoloc and implemented in the Bioconductor MSnbase package enables the conversion from/to transcriptomics data, in particular microarray data available as ExpressionSet objects using the as coercion methods (see the MSnSet section in the MSnbase-development vignette). As a result, it is straightforward to apply the methods summarised here in detailed in the other pRoloc vignettes to these other data structures.

3 Unsupervised machine learning

Unsupervised machine learning refers to clustering, i.e. finding structure in a quantitative, generally multi-dimensional data set of unlabelled data.

Currently, unsupervised clustering facilities are available through the plot2D function and the MLInterfaces package (Carey et al., n.d.). The former takes an MSnSet instance and represents the data on a scatter plot along the first two principal components. Arbitrary feature meta-data can be represented using different colours and point characters. The reader is referred to the manual page available through ?plot2D for more details and examples.

pRoloc also implements a MLean method for MSnSet instances, allowing to use the relevant infrastructure with the organelle proteomics framework. Although provides a common interface to unsupervised and numerous supervised algorithms, we refer to the pRoloc tutorial for its usage to several clustering algorithms.

Note Current development efforts in terms of clustering are described on the Clustering infrastructure wiki page (https://github.com/lgatto/pRoloc/wiki/Clustering-infrastructure) and will be incorporated in future version of the package.

4 Supervised machine learning

Supervised machine learning refers to a broad family of classification algorithms. The algorithms learns from a modest set of labell