Contents

1 Introduction

This tutorial requires the MSnbase (Gatto and Lilley (2011)), MSqRob (Goeminne et al. (2016)), limma (Ritchie et al. (2015)) and msdata packages, as well as some of the tidyverse packages for data tidying and visualisation.

library("MSnbase")
library("MSqRob")
library("msdata")
library("limma")
library("tidyverse")

Sections 2 to 4 deal with raw mass spectrometry and identification data stored in one of the open community-developed data formats (see below). Section 5 describes how to produce or import quantitative data and the sections thereafter focus on the processing and analysis of differentially expressed proteins.

Data files in mass spectrometry

Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.

Type Format Package
raw mzML, mzXML, netCDF, mzData MSnbase (read and write in version >= 2.3.13) via mzR
identification mzIdentML mzID (read) and MSnbase (read, via mzR)
quantitation mzQuantML
peak lists mgf MSnbase (read)
quantitation (with id) mzTab, spreadsheets MSnbase (read, read/write)
protein db fasta Biostrings

In the next sections, we will use raw data (in mzML and mzXML format), identification data (in mzIdentML format) and quantitative data (from a tab-separated spreadsheet).

2 Raw MS data

Raw data files (in any of the above formats) is read into R using readMSData function from the MSnbase package, that will return an list-like object of class MSnExp. Below, we first extract the full path to the MS3TMT11.mzML file from the msdata package1 The proteomics, ident and quant msdata functions return example files for raw, identification and quantitative data respectively. before reading it in.

basename(fl3 <- msdata::proteomics(full.name = TRUE, pattern = "MS3TMT11"))
## [1] "MS3TMT11.mzML"
(rw3 <- readMSData(fl3, mode = "onDisk"))
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.53 Mb
## - - - Spectra data - - -
##  MS level(s): 1 2 3 
##  Number of spectra: 994 
##  MSn retention times: 45:27 - 47:6 minutes
## - - - Processing information - - -
## Data loaded [Thu Jul 25 07:51:22 2019] 
##  MSnbase version: 2.10.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: MS3TMT11.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   MS3TMT11.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
##   fvarLabels: fileIdx spIdx ... spectrum (33 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

Note: above, we specify to use the on disk mode (as opposed to in memory) to avoid loading the whole raw data file(s) into memory. With on disk, the raw data will be accessed on demand. Here, we only read a single MS file into R, but with on disk mode, we could load 100s thereof.

An MSnExp object contains

Exercise: Extract the first, second and tenth spectrum. What are their MS levels, precursor m/z and retention times?

rw3[[1]]
## Object of class "Spectrum1"
##  Retention time: 45:27 
##  MSn level: 1 
##  Total ion count: 10768 
##  Polarity: 1

We can also access specific data for the whole experiment using accessors:

See ?MSnExp for details.

Exercise: Using msLevel, extract the MS level of the 994 spectra of the rw3 file. What levels are available in these data. Using centroided, check what spectra and centroided of in profile mode (see figure below). Which are the levels that are centroided or in profile mode.

Peak picking. Raw profile mode data (left) as measured by the mass spectrometre, and processed, centroided data (right). See also this [`MSnbase` vignette](http://lgatto.github.io/MSnbase/articles/v03-MSnbase-centroiding.html).

Figure 1: Peak picking
Raw profile mode data (left) as measured by the mass spectrometre, and processed, centroided data (right). See also this MSnbase vignette.

table(msLevel(rw3), centroided(rw3))
##    
##     FALSE TRUE
##   1    30    0
##   2     0  482
##   3     0  482

3 Identification results

Identification data in mzIdentML is parsed and loaded as a data.frame using the readMzIdData function.

basename(idf <- msdata::ident(full.name = TRUE))
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
iddf <- readMzIdData(idf)

Among the identification variables available, you will find MS.GF.RawScore, the identification score computated by MSGF+ (the search engine used for the search), and isDecoy.

names(iddf)

Exercise: How many PSMs from the read and decoy database are recoded in the identification data? Calculate summary statistics for MSGF+ raw scores.

table(iddf$isDecoy)
## 
## FALSE  TRUE 
##  2906  2896
summary(iddf$MS.GF.RawScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -141.00  -31.00  -13.00  -14.24    3.00  214.00

Exercise: Reproduce and interpret the plot showing the identification score distribution for the decoy and real peptides in this file.

ggplot(iddf, aes(x = MS.GF.RawScore, colour = isDecoy)) +
    geom_density()

4 Combining raw and identification data

When working with raw and identification data, these two can be merged by adding the identification results to the raw feature meta-data slot. Below, we use small data that is shipped with the MSnbase package. The raw data being very small, we can afford to read it into memory without specifying mode = "onDisk".

basename(quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$"))
## [1] "dummyiTRAQ.mzXML"
basename(identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                          full.name = TRUE, pattern = "dummyiTRAQ.mzid"))
## [1] "dummyiTRAQ.mzid"
msexp <- readMSData(quantFile)
fvarLabels(msexp)
## [1] "spectrum"

Add this stage, there’s only one spectrum feature variable with the spectra indices. Below we add the identification data with addIdentificationData to gain 33 new feature variables.

msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
##  [1] "spectrum"                 "acquisition.number"      
##  [3] "sequence"                 "chargeState"             
##  [5] "rank"                     "passThreshold"           
##  [7] "experimentalMassToCharge" "calculatedMassToCharge"  
##  [9] "modNum"                   "isDecoy"                 
## [11] "post"                     "pre"                     
## [13] "start"                    "end"                     
## [15] "DatabaseAccess"           "DBseqLength"             
## [17] "DatabaseSeq"              "DatabaseDescription"     
## [19] "scan.number.s."           "idFile"                  
## [21] "MS.GF.RawScore"           "MS.GF.DeNovoScore"       
## [23] "MS.GF.SpecEValue"         "MS.GF.EValue"            
## [25] "modName"                  "modMass"                 
## [27] "modLocation"              "subOriginalResidue"      
## [29] "subReplacementResidue"    "subLocation"             
## [31] "nprot"                    "npep.prot"               
## [33] "npsm.prot"                "npsm.pep"

Note that prior to addition, the identification data is filtered as documented in the filterIdentificationDataFrame function:

  1. only PSMs matching the regular (non-decoy) database are retained;
  2. PSMs of rank greater than 1 are discarded;
  3. only proteotypic peptides are kept.

5 Quantitative data

Quantitative data is stored a objects of class MSnSet, that are composed of an expression matrix (accessed with exprs()), a feature meta-data dataframe (accessed with fData()) and a meta-data dataframe (accessed with pData()).

The MSnSet structure

Figure 2: The MSnSet structure

MSnSet objects can be:

  1. Created from an MSnExp object with the quantify function.
  2. Created from a spreadsheet produced by a third party software using the readMSnSet2 function.

5.1 Quantifying raw data

data(itraqdata)
itraqdata
## MSn experiment data ("MSnExp")
## Object size in memory: 1.9 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 55 
##  MSn retention times: 19:9 - 50:18 minutes
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
##  MSnbase version: 1.1.22 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1
##   varLabels: sampleNames sampleNumbers
##   varMetadata: labelDescription
## Loaded from:
##   dummyiTRAQ.mzXML 
## protocolData: none
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ProteinDescription
##     PeptideSequence
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msnset <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
msnset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
##   varLabels: mz reporters
##   varMetadata: labelDescription
## featureData
##   featureNames: X1 X10 ... X9 (55 total)
##   fvarLabels: spectrum ProteinAccession ... collision.energy (15
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: No annotation 
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011 
## Updated from version 0.3.0 to 0.3.1 [Fri Jul  8 20:23:25 2016] 
## iTRAQ4 quantification by trapezoidation: Thu Jul 25 07:51:26 2019 
##  MSnbase version: 1.1.22

Below, we plot the first MS2 spectrum from the itraqdata test data, highlighting the four iTRAQ reporter ions and extract the quantiation values and feature metadata of that same first record.

plot(itraqdata[[1]], reporters = iTRAQ4, full = TRUE)