Modified: 2017-06-15 07:02:28
Compiled: Thu Jun 15 07:02:43 2017
Abstract In this workflow, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The workflow is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.
This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
CSAMA 2017 Some part of this workshop require online access. The most bandwidth expensive operations are files downloads. Please copy the files from the local network and place them in your working directory1 If you are not sure what your working directory is, type
getwd()
or ask a helper.. This will stop them from being downloaded again.
Before we start:
If you identify typos, if there are parts that you would like to see expended or clarified, please let me know by telling me directly (during workshops), opening a github issue or by emailing me directly. Please do also briefly specify your background/familiarity with mass spectrometry and/or proteomics (beginner, intermediate or expert) so that I can update accordingly.
In recent years, there we have seen an increase in the number of packages to analyse mass spectrometry and proteomics data for R and Bioconductor, as well as an increase in total number of downloads.
It is also good to highlight that several of these package have become a group efforts, supported by several developers in the community. This post illustrates the various contributions to MSnbase; r Biocpkg("mzR")
has benefited by a similar wide range of successful contributions. Both packages, and in particular mzR, are used by many others, and will be described in some detail in this workflow.
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
This workflow provides a general introduction to Bioconductor software for mass spectrometry and proteomics. If you are interested in
vignette("pRoloc-tutorial", package = "pRoloc")
or online.vignette("msnid_vignette", package = "MSnID")
or online. In addition, the vignettes of the msmsTest package describe how to analyse spectral counting data using packages dedicated for the analysis of high throughput sequencing data.vignette("MALDIquant-intro", package = "MALDIquant")
and available online.vignette("Cardinal-walkthrough", package = "Cardinal")
and online.CSAMA 2017 This section is not relevant, as you should already have installed all required packages.
The follow packages will be used throughout this documents. R version 3.3.1
or higher is required to install all the packages using BiocInstaller::biocLite
.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocInstaller")
biocLite("RforProteomics", dependencies = TRUE)
Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.
In Bioconductor version 3.6, there are respectively 92 proteomics, 62 mass spectrometry software packages and 17 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages()
, massSpectrometryPackages()
and massSpectrometryDataPackages()
and explored interactively, or looked at by exploring the respective biocViews
on the Bioconductor web page.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats described in the table are supported in R
. We will see how to read and access these data in the following sections.
Type | Format | Package |
---|---|---|
raw | mzML, mzXML, netCDF, mzData | mzR (read) |
identification | mzIdentML | mzR (read) and mzID (read) |
quantitation | mzQuantML | |
peak lists | mgf | MSnbase (read) |
other | mzTab | MSnbase (read) |
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRoteomics IDEntifications (PRIDE) database at the EBI for mass spectrometry-based experiments (including quantitative data, as opposed as the name suggests), PASSEL at the ISB for Selected Reaction Monitoring (SRM, i.e. targeted) data and the MassIVE resource. These data can be downloaded within R using the rpx package.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD004192 2017-06-14 12:23:35 New
## 2 PXD004917 2017-06-14 12:18:17 New
## 3 PXD004950 2017-06-14 12:09:15 New
## 4 PXD004972 2017-06-14 12:04:02 New
## 5 PXD004979 2017-06-14 11:58:10 New
## 6 PXD005725 2017-06-14 11:55:09 New
## 7 PXD006182 2017-06-14 11:38:47 New
## 8 PXD003423 2017-06-14 11:19:48 New
## 9 PXD005730 2017-06-14 09:38:08 New
## 10 PXD004027 2017-06-14 07:26:48 New
## 11 PXD005035 2017-06-14 07:05:21 New
## 12 PXD003983 2017-06-13 15:32:39 Updated information
## 13 PXD004096 2017-06-13 15:16:20 Updated information
## 14 PXD005976 2017-06-13 15:06:09 Updated information
## 15 PXD006232 2017-06-13 15:05:36 Updated information
Using the unique PXD000001
identifier, we can retrieve the relevant metadata that will be stored in a PXDataset
object. The names of the files available in this data can be retrieved with the pxfiles
accessor function.
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 12 files
## [1] 'F063721.dat' ... [12] 'generated'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
## [12] "generated"
Other metadata for the px
data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 Jan;1844(1 Pt A):42-51. Review"
Data files can then be downloaded with the pxget
function. Below, we retrieve the raw data file. The file is downloaded3 If the file is already available, it is not downloaded a second time. in the working directory and the name of the file is return by the function and stored in the mzf
variable for later use.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Downloading 1 file
## /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML already present.
mzf
## [1] "/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
CSAMA 2017 Executing this section will cause raw data file to be downloaded and stored as part of the AnnotationHub’s local resource. The
mzML
file that is available on the local network is identical to the one accessed here.
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that serves various omics datasets. It is possible to contribute and access (albeit currently only a limited number of) proteomics data.
library("AnnotationHub")
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
ah <- AnnotationHub()
## snapshotDate(): 2017-06-08
query(ah, "proteomics")
## AnnotationHub with 4 records
## # snapshotDate(): 2017-06-08
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH49006"]]'
##
## title
## AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
## AH49007 | PXD000001: Peptide-level quantitation data
## AH49008 | PXD000001: raw mass spectrometry data
## AH49009 | PXD000001: MS-GF+ identiciation data
ms <- ah[["AH49008"]]
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The file name, 55314, is not very descriptive because the data originates from the AnnotationHub
cloud repository. If the data was read from a local file, is would be named as the mzML
(or mzXML
) file (see below).
The mzR
package provides an interface to the proteowizard C/C++ code base to access various raw data files, such as mzML
, mzXML
, netCDF
, and mzData
. The data is accessed on-disk, i.e it is not loaded entirely in memory, and only when explicitly requested. The three main functions are openMSfile
to create a file handle to a raw data file, header
to extract metadata about the spectra contained in the file and peaks
to extract one or multiple spectra of interest. Other functions such as instrumentInfo
, or runInfo
can be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
basename(mzf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The object loaded from AnnotationHub in the previous section is of the same type, and was also created by the openMSfile
function. All operations below can equally be applied to it.
The header
function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 22
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 0 0
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 0 0 0.05521463
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h", xlab = "M/Z", ylab = "Intensity")
Exercise Identify the index of the MS2 spectrum with the highest precursor intensity (see the
precursorIntensity
feature variable) and plot it as illustrated above.
i <- which.max(hd$precursorIntensity)
sp <- spectra(ms, i)
plot(sp, type = "h", xlab = "M/Z", ylab = "Intensity")
Exercise Are the MS1 and MS2 spectra in profile mode or centroided? Choose any spectra and plot the region around a peak to find out.
par(mfrow = c(1, 2))
bpmz <- hd[i, "basePeakMZ"]
plot(sp, type = "h", xlim = c(bpmz - 0.5, bpmz + 0.5),
xlab = "M/Z", ylab = "Intensity",
main = "Centroided")
## Spectrum 100 is MS1
sp1 <- peaks(ms, 100)
plot(sp1, type = "l", xlim = c(536,540),
xlab = "M/Z", ylab = "Intensity",
main = "Profile mode")
The RforProteomics
package distributes a small identification result file that we load and parse using infrastructure from the mzR package.
f <- dir(system.file("extdata", package = "RforProteomics"),
pattern = "mzid", full.names=TRUE)
basename(f)
## [1] "TMT_Erwinia.mzid.gz"
Along the lines of what is available for raw data, it provides support fasta parsing mzIdentML
files with the openIDfile
function. As for raw data, the underlying C/C++ code comes from the proteowizard.
library("mzR")
id1 <- openIDfile(f)
id1
## Identification file handle.
## Filename: TMT_Erwinia.mzid.gz
## Number of psms: 5655
Various data can be extracted from the identification object. The peptide spectrum matches (PSMs) and the identification scores can be accessed as a data.frame with psms
and score
respectively.
softwareInfo(id1)
## [1] "MS-GF+ Beta (v10072) "
## [2] "ProteoWizard MzIdentML 3.0.501 ProteoWizard"
enzymes(id1)
## name nTermGain cTermGain minDistance missedCleavages
## 1 Trypsin 0 1000
fid1 <- mzR::psms(id1)
head(fid1)
## spectrumID chargeState rank passThreshold experimentalMassToCharge
## 1 scan=5782 3 1 TRUE 1080.2325
## 2 scan=6037 3 1 TRUE 1002.2089
## 3 scan=5235 3 1 TRUE 1189.2836
## 4 scan=5397 3 1 TRUE 960.5365
## 5 scan=6075 3 1 TRUE 1264.3409
## calculatedMassToCharge sequence modNum
## 1 1080.2321 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR 0
## 2 1002.2115 TQVLDGLINANDIEVPVALIDGEIDVLR 0
## 3 1189.2800 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR 0
## 4 960.5365 SQILQQAGTSVLSQANQVPQTVLSLLR 0
## 5 1264.3419 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR 0
## isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
## 1 FALSE S R 50 84 ECA1932 155
## 2 FALSE R K 288 315 ECA1147 434
## 3 FALSE A R 192 224 ECA0013 295
## 4 FALSE - R 264 290 ECA1731 290
## 5 FALSE F R 119 153 ECA1443 298
## DatabaseDescription acquisitionNum
## 1 ECA1932 outer membrane lipoprotein 5782
## 2 ECA1147 trigger factor 6037
## 3 ECA0013 ribose-binding periplasmic protein 5235
## 4 ECA1731 flagellin 5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
## [ reached getOption("max.print") -- omitted 1 row ]
sc1 <- mzR::score(id1)
head(sc1)
## spectrumID MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue
## 1 scan=5782 147 174 3.764831e-27
## 2 scan=6037 214 245 6.902626e-26
## 3 scan=5235 211 264 1.778789e-25
## 4 scan=5397 154 178 1.792541e-24
## 5 scan=6075 188 252 1.510364e-23
## 6 scan=5761 123 138 1.618941e-23
## MS.GF.EValue
## 1 5.430080e-21
## 2 9.943751e-20
## 3 2.564787e-19
## 4 2.581753e-18
## 5 2.178423e-17
## 6 2.329453e-17
The mzID package, has similar functionality to parse identification files, and was the first one to provide such capabilities in R. The main difference with mzR is that is parses the files using the XMLpackage and reads the whole data into memory rather than relying on proteowizard, and is slower.
Exercise Data wrangling with identification data; the standard tidyverse tools are fit for purpose here. Extract and combine the PSMs and their scores as described above and combine them. From the available data, calculate the length of each peptide (you can use
nchar
with the peptide sequencesequence
) and the number of peptides for each protein (DatabaseDescription
). Plot the length of the proteins for their respective number of peptides. Optionally, stratify the plot by the peptide e-value score (MS.GF.EValue
) using for examplecut
to define bins.
suppressPackageStartupMessages(library("dplyr"))
fid1$peplen <- nchar(as.character(fid1$sequence))
ids <- as_tibble(full_join(fid1, sc1))
## Joining, by = "spectrumID"
npeps <- ids %>% group_by(DatabaseDescription) %>% tally
ids <- full_join(ids, npeps)
## Joining, by = "DatabaseDescription"
library("ggplot2")
ggplot(ids, aes(x = n, y = DBseqLength)) + geom_point()
ids$evalBins <- cut(ids$MS.GF.EValue, summary(ids$MS.GF.EValue))
ggplot(ids, aes(x = n, y = DBseqLength, color = peplen)) +
geom_point() +
facet_wrap(~ evalBins)
While searches are generally performed using third-party software independently of R or can be started from R using a system
call, the MSGFplus package enables to perform a search using the MSGF+ engine, as illustrated below.
We search the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML file against the fasta file from PXD000001
using MSGFplus
.
We first download the fasta files from ProteomeXchange
fas <- pxget(px, "erwinia_carotovora.fasta")
## Downloading 1 file
basename(fas)
## [1] "erwinia_carotovora.fasta"
Below, we setup and run the search4 In the runMSGF
call, the memory allocated to the java virtual machine is limited to 1GB. In general, there is no need to specify this argument, unless you experience an error regarding the maximum heap size..
library("MSGFplus")
msgfpar <- msgfPar(database = fas,
instrument = 'HighRes',
tda = TRUE,
enzyme = 'Trypsin',
protocol = 'iTRAQ')
idres <- runMSGF(msgfpar, mzf, memory=1000)
## "/usr/bin/java" -Xmx1000M -jar '/home/lg390/R/x86_64-pc-linux-gnu-library/3.4/MSGFplus/MSGFPlus/MSGFPlus.jar' -s '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML' -o '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid' -d '/home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta' -tda 1 -inst 1 -e 1 -protocol 2
##
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid... DONE!
idres
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
##
## Database: /home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta
##
## Number of scans: 5343
## Number of PSM's: 5656
## identification file (needed below)
basename(mzID::files(idres)$id)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
A graphical interface to perform the search the data and explore the results is also available:
library("MSGFgui")
MSGFgui()
The rTANDEM package can be used to perform a search with XTandem software.
The MSnID package can be used for post-search filtering of MS/MS identifications. One starts with the construction of an MSnID
object that is populated with identification results that can be imported from a data.frame
or from mzIdenML
files. Here, we will use the example identification data provided with the package.
mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
basename(mzids)
## [1] "c_elegans.mzid.gz"
We start by loading the package, initialising the MSnID
object, and add the identification result from our mzid
file (there could of course be more that one).
library("MSnID")
msnid <- MSnID(".")
## Note, the anticipated/suggested columns in the
## peptide-to-spectrum matching results are:
## -----------------------------------------------
## accession
## calculatedMassToCharge
## chargeState
## experimentalMassToCharge
## isDecoy
## peptide
## spectrumFile
## spectrumID
msnid <- read_mzIDs(msnid, mzids)
## Loaded cached data
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 12263 at 36 % FDR
## #peptides: 9489 at 44 % FDR
## #accessions: 7414 at 76 % FDR
Printing the MSnID
object returns some basic information such as
The package then enables to define, optimise and apply filtering based for example on missed cleavages, identification scores, precursor mass errors, etc. and assess PSM, peptide and protein FDR levels. To properly function, it expects to have access to the following data
## [1] "accession" "calculatedMassToCharge"
## [3] "chargeState" "experimentalMassToCharge"
## [5] "isDecoy" "peptide"
## [7] "spectrumFile" "spectrumID"
which are indeed present in our data:
names(msnid)
## [1] "spectrumID" "scan number(s)"
## [3] "acquisitionNum" "passThreshold"
## [5] "rank" "calculatedMassToCharge"
## [7] "experimentalMassToCharge" "chargeState"
## [9] "MS-GF:DeNovoScore" "MS-GF:EValue"
## [11] "MS-GF:PepQValue" "MS-GF:QValue"
## [13] "MS-GF:RawScore" "MS-GF:SpecEValue"
## [15] "AssumedDissociationMethod" "IsotopeError"
## [17] "isDecoy" "post"
## [19] "pre" "end"
## [21] "start" "accession"
## [23] "length" "description"
## [25] "pepSeq" "modified"
## [27] "modification" "idFile"
## [29] "spectrumFile" "databaseFile"
## [31] "peptide"
Here, we summarise a few steps and redirect the reader to the package’s vignette for more details:
Cleaning irregular cleavages at the termini of the peptides and missing cleavage site within the peptide sequences. The following two function call create the new numMisCleavages
and numIrrCleabages
columns in the MSnID
object
msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
Now, we can use the apply_filter
function to effectively apply filters. The strings passed to the function represent expressions that will be evaluated, this keeping only PSMs that have 0 irregular cleavages and 2 or less missed cleavages.
msnid <- apply_filter(msnid, "numIrregCleavages == 0")
msnid <- apply_filter(msnid, "numMissCleavages <= 2")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 7838 at 17 % FDR
## #peptides: 5598 at 23 % FDR
## #accessions: 3759 at 53 % FDR
Using "calculatedMassToCharge"
and "experimentalMassToCharge"
, the mass_measurement_error
function calculates the parent ion mass measurement error in parts per million.
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2184.0640 -0.6992 0.0000 17.6146 0.7512 2012.5178
We then filter any matches that do not fit the +/- 20 ppm tolerance
msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -19.7797 -0.5866 0.0000 -0.2970 0.5713 19.6758
Filtering of the identification data will rely on
msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
MS2 filters are handled by a special MSnIDFilter
class objects, where individual filters are set by name (that is present in names(msnid)
) and comparison operator (>, <, = , …) defining if we should retain hits with higher or lower given the threshold and finally the threshold value itself.
filtObj <- MSnIDFilter(msnid)
filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
filtObj$msmsScore <- list(comparison=">", threshold=10.0)
show(filtObj)
## MSnIDFilter object
## (absParentMassErrorPPM < 10) & (msmsScore > 10)
We can then evaluate the filter on the identification data object, which return the false discovery rate and number of retained identifications for the filtering criteria at hand.
evaluate_filter(msnid, filtObj)
## fdr n
## PSM 0 3807
## peptide 0 2455
## accession 0 1009
Rather than setting filtering values by hand, as shown above, these can be set automativally to meet a specific false discovery rate.
filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
method="Grid", level="peptide",
n.iter=500)
show(filtObj.grid)
## MSnIDFilter object
## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)
evaluate_filter(msnid, filtObj.grid)
## fdr n
## PSM 0.004097561 5146
## peptide 0.006447651 3278
## accession 0.021996616 1208
Filters can eventually be applied (rather than just evaluated) using the apply_filter
function.
msnid <- apply_filter(msnid, filtObj.grid)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5146 at 0.41 % FDR
## #peptides: 3278 at 0.64 % FDR
## #accessions: 1208 at 2.2 % FDR
And finally, identifications that matched decoy and contaminant protein sequences are removed
msnid <- apply_filter(msnid, "isDecoy == FALSE")
msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5117 at 0 % FDR
## #peptides: 3251 at 0 % FDR
## #accessions: 1179 at 0 % FDR
The resulting filtered identification data can be exported to a data.frame
or to a dedicated MSnSet
data structure for quantitative MS data, described below, and further processed and analyses using appropriate statistical tests.
The above sections introduced low-level interfaces to raw and identification results. The MSnbase package provides abstractions for raw data through the MSnExp
class and containers for quantification data via the MSnSet
class. Both store
spectra
(or the [
, [[
operators) or exprs
;data.frame
with pData
;data.frame
with fData
.Another useful slot is processingData
, accessed with processingData(.)
, that records all the processing that objects have undergone since their creation (see examples below).
The readMSData2
(there is also a readMSData
variant - see the benchmarking
vignette[^Open it with vignette("benchmarking", package = "MSnbase")
or read it online] in MSnbase for their differences) will parse the raw data, extract all MS spectra (by default, but individual MS levels can selected, as illustrated below) and construct an MS experiment object of class MSnExp
.
library("MSnbase")
rawFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
basename(rawFile)
## [1] "dummyiTRAQ.mzXML"
msexp <- readMSData2(rawFile, msLevel = 2L)
msexp
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.03 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 5
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded [Thu Jun 15 07:03:12 2017]
## Filter: select MS level(s) 2 [Thu Jun 15 07:03:12 2017]
## MSnbase version: 2.3.3
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1 ... X5.1 (5 total)
## fvarLabels: fileIdx spIdx ... spectrum (27 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
MS2 spectra can be extracted as a list of Spectrum2
objects with the spectra
accessor or as a subset of the original MSnExp
data with the [
operator. Individual spectra can be accessed with [[
.
length(msexp)
## [1] 5
msexp[1:2]
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.03 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 2
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded [Thu Jun 15 07:03:12 2017]
## Filter: select MS level(s) 2 [Thu Jun 15 07:03:12 2017]
## MSnbase version: 2.3.3
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1
## fvarLabels: fileIdx spIdx ... spectrum (27 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msexp[[2]]
## Object of class "Spectrum2"
## Precursor: 546.9586
## Retention time: 25:2
## Charge: 3
## MSn level: 2
## Peaks count: 1012
## Total ion count: 56758067
The identification results stemming from the same raw data file can then be used to add PSM matches.
fData(msexp)
## fileIdx spIdx centroided smoothed seqNum acquisitionNum msLevel
## X1.1 1 1 NA NA 1 1 2
## X2.1 1 2 NA NA 2 2 2
## X3.1 1 3 NA NA 3 3 2
## polarity originalPeaksCount totIonCurrent retentionTime basePeakMZ
## X1.1 1 2921 182542000 1501.35 645.374
## X2.1 1 1012 16488100 1501.59 703.338
## X3.1 1 2125 41969700 1501.85 117.115
## basePeakIntensity collisionEnergy ionisationEnergy lowMZ highMZ
## X1.1 16292100 40 0 100 2016.660
## X2.1 2150600 40 0 100 1707.060
## X3.1 2332700 40 0 100 1701.750
## precursorScanNum precursorMZ precursorCharge precursorIntensity
## X1.1 0 645.3741 3 47659400
## X2.1 0 546.9586 3 26356100
## X3.1 0 645.3741 2 23432400
## mergedScan mergedResultScanNum mergedResultStartScanNum
## X1.1 0 0 0
## X2.1 0 0 0
## X3.1 0 0 0
## mergedResultEndScanNum injectionTime spectrum
## X1.1 0 0 1
## X2.1 0 0 2
## X3.1 0 0 3
## [ reached getOption("max.print") -- omitted 2 rows ]
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid")
basename(identFile)
## [1] "dummyiTRAQ.mzid"
msexp <- addIdentificationData(msexp, identFile)
fvarLabels(msexp)
## [1] "fileIdx" "spIdx"
## [3] "centroided" "smoothed"
## [5] "seqNum" "acquisitionNum"
## [7] "msLevel" "polarity"
## [9] "originalPeaksCount" "totIonCurrent"
## [11] "retentionTime" "basePeakMZ"
## [13] "basePeakIntensity" "collisionEnergy"
## [15] "ionisationEnergy" "lowMZ"
## [17] "highMZ" "precursorScanNum"
## [19] "precursorMZ" "precursorCharge"
## [21] "precursorIntensity" "mergedScan"
## [23] "mergedResultScanNum" "mergedResultStartScanNum"
## [25] "mergedResultEndScanNum" "injectionTime"
## [27] "spectrum" "scan number(s)"
## [29] "passthreshold" "rank"
## [31] "calculatedmasstocharge" "experimentalmasstocharge"
## [33] "chargestate" "ms-gf:denovoscore"
## [35] "ms-gf:evalue" "ms-gf:rawscore"
## [37] "ms-gf:specevalue" "assumeddissociationmethod"
## [39] "isotopeerror" "isdecoy"
## [41] "post" "pre"
## [43] "end" "start"
## [45] "accession" "length"
## [47] "description" "pepseq"
## [49] "modified" "modification"
## [51] "idFile" "databaseFile"
## [53] "nprot" "npep.prot"
## [55] "npsm.prot" "npsm.pep"
The readMSData2
(readMSdata
) and addIdentificationData
make use of the mzR
and mzID
packages to access the raw and identification data.
Exercise Are the spectra in the
msexp
object centroided? This time, instead of looking at the spectra, we can query (and set) this information directly using thecentroided
function (andcentroided<-
replacement function). TheisCentroided
function can be used to assess the mode from the data.
centroided(msexp)
## X1.1 X2.1 X3.1 X4.1 X5.1
## NA NA NA NA NA
isCentroided(msexp)
## [1] FALSE FALSE FALSE FALSE FALSE
centroided(msexp) <- isCentroided(msexp)
Spectra and (parts of) experiments can be extracted and plotted.
msexp[[1]]
## Object of class "Spectrum2"
## Precursor: 645.3741
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
plot(msexp[[1]], full=TRUE)
As this data was labeled with iTRAQ4 isobaric tags, we can highlight these four peaks of interest with
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
msexp[1:3]
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.03 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 3
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded [Thu Jun 15 07:03:12 2017]
## Filter: select MS level(s) 2 [Thu Jun 15 07:03:12 2017]
## MSnbase version: 2.3.3
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1 X3.1
## fvarLabels: fileIdx spIdx ... npsm.pep (56 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(msexp[1:3], full=TRUE)
Exercise Repeat the previous combination of raw and identification data with the TMT_Erwinia
mzML
andmzid
files that were imported from ProteomeXchange and generated using MSGFplus respectively. Here, we know that the MS1 and MS2 level are in profile mode and centroided respectively. Read thereadMSData2
documentation to set thecentroided.
argument to set these attributes appropriately.
tmterw <- readMSData2(mzf, centroided = c(FALSE, TRUE))
table(centroided(tmterw), msLevel(tmterw))
##
## 1 2
## FALSE 1431 0
## TRUE 0 6103
## Let's filter MS2 spectra after reading (optional) and add the
## identification data using pipes
library("magrittr")
tmterw <- tmterw %>%
filterMsLevel(2L) %>%
addIdentificationData(idres)
tmterw
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 3.06 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 6103
## MSn retention times: 3:22 - 60:2 minutes
## - - - Processing information - - -
## Data loaded [Wed Jun 14 08:30:08 2017]
## Filter: select MS level(s) 2 [Wed Jun 14 08:30:08 2017]
## MSnbase version: 2.3.3
## - - - Meta data - - -
## phenoData
## rowNames:
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## protocolData: none
## featureData
## featureNames: X0227.1 X0228.1 ... X7534.1 (6103 total)
## fvarLabels: fileIdx spIdx ... npsm.pep (58 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
The importance of flexible access to specialised data becomes visible in the figure below5 Figure and code taken from the RforProteomics visualisation vignette. Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.
In this code chunks we start by selecting relevant spectra of interest. We will focus on the first MS1 spectrum acquired after 30 minutes of retention time.
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
Now now extract and plot all relevant information:
chromatogram
function.plot(chromatogram(ms)[[1]], type = "l")
plot(chromatogram(ms)[[1]], type = "l")
abline(v = hd[i, "retentionTime"]/60, col = "red")
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5))
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
length(ms2)
MS2 spectra highlighted above.par(mfrow = c(5, 2), mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Putting it all together:
layout(lout)
par(mar=c(4,2,1,1))
plot(chromatogram(ms)[[1]], type = "l")
abline(v = hd[i, "retentionTime"]/60, col = "red")
par(mar = c(3, 2, 1, 0))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
par(mar = c(3, 0.5, 1, 1))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5),
yaxt = "n")
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
par(mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
## 1
ff <- colorRampPalette(c("yellow", "steelblue"))
trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
## 1
m3 <- plot3D(M2)
grid.arrange(m1, m2, m3, ncol = 3)
Below, we have animations build from extracting successive slices as above.
Annotated spectra and comparing spectra.
par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE)
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:
calculateFragments("SIGFEGDSIGR")
## mz ion type pos z seq
## 1 88.03931 b1 b 1 1 S
## 2 201.12337 b2 b 2 1 SI
## 3 258.14483 b3 b 3 1 SIG
## 4 405.21324 b4 b 4 1 SIGF
## 5 534.25583 b5 b 5 1 SIGFE
## 6 591.27729 b6 b 6 1 SIGFEG
## 7 706.30423 b7 b 7 1 SIGFEGD
## 8 793.33626 b8 b 8 1 SIGFEGDS
## 9 906.42032 b9 b 9 1 SIGFEGDSI
## 10 963.44178 b10 b 10 1 SIGFEGDSIG
## 11 1119.54289 b11 b 11 1 SIGFEGDSIGR
## 12 175.11895 y1 y 1 1 R
## 13 232.14041 y2 y 2 1 GR
## 14 345.22447 y3 y 3 1 IGR
## 15 432.25650 y4 y 4 1 SIGR
## 16 547.28344 y5 y 5 1 DSIGR
## [ reached getOption("max.print") -- omitted 20 rows ]
Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipulate them and perform computations. The two spectra corresponding to the IMIDLDGTENK
peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra
for details).
There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.
Label-free | Labelled | |
---|---|---|
MS1 | XIC | SILAC, 15N |
MS2 | Counting | iTRAQ, TMT |
In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the xcms infrastructure.
Below is a list of suggested packages for some common proteomics quantitation technologies:
An MSnExp
is converted to an MSnSet
by the quantitation
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the iTRAQ4
parameter; other tags are available: see ?ReporterIons
) and the max
method to calculate the use the maximum of the reporter peak for quantitation.
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
msset <- quantify(msexp, method = "max", reporters = iTRAQ4)
The figure below give a schematics of an MSnSet
instance and the relation between the assay data and the respective feature and sample metadata, accessible respectively with the exprs
, fData
and pData
functions.
New columns can be added to the metadata slots.
exprs(msset)
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## X1.1 706555.7 685055.1 929016.1 668245.2
## X2.1 260663.7 212745.0 163782.8 239142.7
## X3.1 2213566.0 2069209.6 2204032.2 2331846.8
## X4.1 616043.4 705976.6 671828.8 666845.6
## X5.1 1736128.2 1787622.5 1795311.8 1825523.0
pData(msset)
## mz reporters
## iTRAQ4.114 114.1112 iTRAQ4
## iTRAQ4.115 115.1083 iTRAQ4
## iTRAQ4.116 116.1116 iTRAQ4
## iTRAQ4.117 117.1150 iTRAQ4
pData(msset)$groups <- rep(c("Treat", "Cond"), each = 2)
pData(msset)
## mz reporters groups
## iTRAQ4.114 114.1112 iTRAQ4 Treat
## iTRAQ4.115 115.1083 iTRAQ4 Treat
## iTRAQ4.116 116.1116 iTRAQ4 Cond
## iTRAQ4.117 117.1150 iTRAQ4 Cond
Another useful slot is processingData
, accessed with processingData(.)
, that records all the processing that objects have undergone since their creation.
processingData(msset)
## - - - Processing information - - -
## Fast iTRAQ4 quantitation by max [Thu Jun 15 07:03:36 2017]
## MSnbase version: 2.3.3
See also The isobar package supports quantitation from centroided mgf
peak lists or its own tab-separated files that can be generated from Mascot and Phenyx vendor files.
Other MS2 quantitation methods available in quantify
include the (normalised) spectral index SI
and (normalised) spectral abundance factor SAF
or simply a simple count method6 The code below is for illustration only - it doesn’t make much sense to perform any of these quantitations on such a multiplexed data.
exprs(si <- quantify(msexp, method = "SIn"))
## dummyiTRAQ.mzXML
## ECA0510 0.003588641
## ECA1028 0.001470129
exprs(saf <- quantify(msexp, method = "NSAF"))
## dummyiTRAQ.mzXML
## ECA0510 0.6235828
## ECA1028 0.3764172
Note that spectra that have not been assigned any peptide (NA
) or that match non-unique peptides (npsm > 1
) are discarded in the counting process.
As shown above, the MSnID package enables to explore and assess the confidence of identification data using mzid
files. A subset of all peptide-spectrum matches, that pass a specific false discovery rate threshold can them be converted to an MSnSet
, where the number of peptide occurrences are used to populate the assay data.
The Proteomics Standard Initiative (PSI) mzTab
file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections. These data can be imported with the readMzTabData
function7 We specify version 0.9 (which generates the warning) to fit with the version of that file. For recent files, the version
argument should be ignored to use the importer for the current file version 1.0..
mztf <- pxget(px, "F063721.dat-mztab.txt")
## Downloading 1 file
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
## Warning: Version 0.9 is deprecated. Please see '?readMzTabData' and '?
## MzTab' for details.
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Tue Jun 13 17:55:11 2017
## MSnbase version: 2.3.3
It is also possible to import arbitrary spreadsheets as MSnSet
objects into R with the readMSnSet2
function. The main 2 arguments of the function are (1) a text-based spreadsheet and (2) column names of indices that identify the quantitation data. The latter can be queried with the getEcols
function.
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
## [1] "\"Protein ID\"" "\"FBgn\""
## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\""
## [5] "\"Mascot score\"" "\"No. peptides quantified\""
## [7] "\"area 114\"" "\"area 115\""
## [9] "\"area 116\"" "\"area 117\""
## [11] "\"PLS-DA classification\"" "\"Peptide sequence\""
## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\""
## [15] "\"pd.2013\"" "\"pd.markers\""
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
## area.114 area.115 area.116 area.117
## 1 0.379000 0.281000 0.225000 0.114000
## 2 0.420000 0.209667 0.206111 0.163889
## 3 0.187333 0.167333 0.169667 0.476000
## 4 0.247500 0.253000 0.320000 0.179000
## 5 0.216000 0.183000 0.342000 0.259000
## 6 0.072000 0.212333 0.573000 0.142667
head(fData(res))
## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86
## 2 CG10067 FBgn0000044 Act57B 5 222.40
## 3 CG10077 FBgn0035720 CG10077 5 219.65
## 4 CG10079 FBgn0003731 Egfr 2 86.39
## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10
## 6 CG10130 FBgn0010638 Sec61beta 2 79.90
## No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1 1 PM
## 2 9 PM
## 3 3
## 4 2 PM
## 5 1 GGVFDTIQK
## 6 3 ER/Golgi
## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers
## 1 PM unknown
## 2 PM unknown
## 3 unknown unknown
## 4 PM unknown
## 5 626.887 2 Phenotype 1 unknown
## 6 ER/Golgi ER
For raw data processing look at MSnbase
’s clean
, smooth
, pickPeaks
, removePeaks
and trimMz
for MSnExp
and spectra processing methods.
As an illustration, we show the pickPeaks
function on the itraqdata
data.
library("ggplot2") ## for coord_cartesian
data(itraqdata)
plot(itraqdata[[10]], full = TRUE) +
coord_cartesian(xlim = c(915, 925))
itraqdata2 <- pickPeaks(itraqdata)
plot(itraqdata2[[10]], full = TRUE, w1 = 0.05) +
coord_cartesian(xlim = c(915, 925))
The MALDIquant and xcms packages also features a wide range of raw data processing methods on their own ad hoc data instance types.
Each different types of quantitative data will require their own pre-processing and normalisation steps. Both isobar
and MSnbase
allow to correct for isobaric tag impurities normalise the quantitative data.
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
impurities <- matrix(c(0.929, 0.059, 0.002, 0.000,
0.020, 0.923, 0.056, 0.001,
0.000, 0.030, 0.924, 0.045,
0.000, 0.001, 0.040, 0.923),
nrow = 4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt <- purityCorrect(qnt, impurities)
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jun 15 07:03:41 2017
## Purity corrected: Thu Jun 15 07:03:41 2017
## MSnbase version: 1.1.22
Various normalisation methods can be applied the MSnSet
instances using the normalise
method: variance stabilisation (vsn
), quantile (quantiles
), median or mean centring (center.media
or center.mean
), …
qnt <- normalise(qnt, "quantiles")
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jun 15 07:03:41 2017
## Purity corrected: Thu Jun 15 07:03:41 2017
## Normalised (quantiles): Thu Jun 15 07:03:41 2017
## MSnbase version: 1.1.22
The combineFeatures
method combines spectra/peptides quantitation values into protein data. The grouping is defined by the groupBy
parameter, which is generally taken from the feature metadata (protein accessions, for example).
gb <- fData(qnt)$ProteinDescription
prt <- combineFeatures(qnt, groupBy = gb, fun = "median")
processingData(prt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Thu Jun 15 07:03:41 2017
## Purity corrected: Thu Jun 15 07:03:41 2017
## Normalised (quantiles): Thu Jun 15 07:03:41 2017
## Combined 55 features into 39 using median: Thu Jun 15 07:03:41 2017
## MSnbase version: 2.3.3
Finally, proteomics data analysis is generally hampered by missing values. Missing data imputation is a sensitive operation whose success will be guided by many factors, such as degree and (non-)random nature of the missingness.
Below, we load an MSnSet
with missing values, count the number missing and non-missing values.
data(naset)
table(is.na(naset))
##
## FALSE TRUE
## 10254 770
The naplot
figure will reorder cells within the data matrix so that the experiments and features with many missing values will be grouped towards the top and right of the heatmap, and barplots at the top and right summarise the number of missing values in the respective samples (column) and rows (rows).
naplot(naset)
The importance of missing values in a dataset will depend on the quantitation technology employed. Label-free quantitation in particular can suffer from a very high number of missing values.
Missing value in MSnSet
instances can be filtered out with the filterNA
functions. By default, it removes features that contain at least NA
value.
## remove features with missing values
tmp <- filterNA(naset)
processingData(tmp)
## - - - Processing information - - -
## Subset [689,16][301,16] Thu Jun 15 07:03:41 2017
## Removed features with more than 0 NAs: Thu Jun 15 07:03:41 2017
## Dropped featureData's levels Thu Jun 15 07:03:41 2017
## MSnbase version: 1.15.6
It is of course possible to impute missing values (?impute
). This is however not a straightforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %)8 Note that when using limma for instance, downstream analyses can handle missing values. Still, it is recommended to explore missingness as part of the exploratory data analysis.. But also, there are two types of mechanisms resulting in missing values in LC/MSMS experiments.
Missing values resulting from absence of detection of a feature, despite ions being present at detectable concentrations. For example in the case of ion suppression or as a result from the stochastic, data-dependent nature of the MS acquisition method. These missing value are expected to be randomly distributed in the data and are defined as missing at random (MAR) or missing completely at random (MCAR).
Biologically relevant missing values, resulting from the absence or the low abundance of ions (below the limit of detection of the instrument). These missing values are not expected to be randomly distributed in the data and are defined as missing not at random (MNAR).
Different imputation methods are more appropriate to different classes of missing values (as documented in this paper). Values missing at random, and those missing not at random should be imputed with different methods.
Generally, it is recommended to use hot deck methods (nearest neighbour (left), maximum likelihood, …) when data are missing at random.Conversely, MNAR features should ideally be imputed with a left-censor (minimum value (right), but not zero, …) method.
## impute missing values using knn imputation
tmp <- impute(naset, method = "knn")
## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
## mean imputation used for these rows
processingData(tmp)
## - - - Processing information - - -
## Data imputation using knn Thu Jun 15 07:03:42 2017
## Using default parameters
## MSnbase version: 1.15.6
There are various methods to perform data imputation, as described in ?impute
.
R in general and Bioconductor in particular are well suited for the statistical analysis of data of quantitative proteomics data. Several packages provide dedicated resources for proteomics data:
MSstats: A set of tools for statistical relative protein significance analysis in Data dependent (DDA), SRM and Data independent acquisition (DIA) experiments. Data stored in data.frame
or MSnSet
objects can be used as input.
msmsTests: Statistical tests for label-free LC-MS/MS data by spectral counts, to discover differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. All can be readily applied on MSnSet
instances produced, for example by MSnID
.
isobar also provides dedicated infrastructure for the statistical analysis of isobaric data.
The MLInterfaces package provides a unified interface to a wide range of machine learning algorithms. Initially developed for microarray and ExpressionSet
instances, the pRoloc package enables application of these algorithms to MSnSet
data.
Dimensionality reduction is very frequently used to summarise high-dimensional data. Below we will use principal component analysis (PCA), but other methods can be applied. Below, we will use the plot2D
function from the pRoloc package9 While originally developed for the analysis of spatial/organelle proteomics data in mind, it is applicable many use cases., that will extract the expression values in the assay data, perform dimensionality reduction, an produce the scatter plot.
Let’s first use plot2D
to visualise the pattern in 20 protein quantitation values (initial 20 dimensional data). Here, we use an example from spatial proteomics, where the quantitative protein profiles reflect the proteins sub-cellular localisation (from Christoforou et al, 2016, see also Breckels et al, 2016 for more data analysis background). We will use the known localisation of some proteins (marker proteins) to annotate the plot (using the fcol
argument).
library("pRoloc")
library("pRolocdata")
data(hyperLOPIT2015)
plot2D(hyperLOPIT2015, fcol = "markers")
addLegend(hyperLOPIT2015, fcol = "markers", cex = .7)
In other cases, we want to visualise the relation of samples. plot2D
uses the rows of the data to perform dimensionality reduction. To use the columns, we just need to transpose the MSnSet
. By doing so, the pData
becomes the fData
and vice versa.
Let’s use a time-course experiment on stem cells (Mulvey et al. 2015). Below, we use the times
(time points) variable to set colours and rep
(replicate numbers) to set the plotting characters.
data(mulvey2015)
head(pData(mulvey2015))
## rep times cond
## rep1_0hr 1 1 1
## rep1_16hr 1 2 1
## rep1_24hr 1 3 1
## rep1_48hr 1 4 1
## rep1_72hr 1 5 1
## rep1_XEN 1 6 1
plot2D(t(mulvey2015), fcol = "times", fpch = "rep", cex = 2)
addLegend(t(mulvey2015), fcol = "times")
## Classification
The example below uses knn
with the 5 closest neighbours and the MLInterfaces package as an illustration to classify proteins of unknown sub-cellular localisation to one of 9 possible organelles.
library("MLInterfaces")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5),
## trainInd = traininds)
## Predicted outcome distribution for test set:
##
## ER lumen ER membrane Golgi Mitochondrion Plastid
## 5 140 67 51 29
## PM Ribosome TGN vacuole
## 89 31 6 10
## Summary of scores on test set (use testScores() method for details):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 1.0000 1.0000 0.9332 1.0000 1.0000
A wide range of classification and clustering algorithms are also available, as described in the ?MLearn
documentation page, used below.
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
## clusteringOutput: partition table
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 55 28 29 18 60 30 44 145 78 116 33 53
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)
plot(kcl, exprs(dunkley2006))
All the Bioconductor annotation infrastructure, such as biomaRt, GO.db, organism specific annotations, .. are directly relevant to the analysis of proteomics data. A total of 191 ontologies, including some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the rols
library("rols")
res <- OlsSearch(q = "ESI", ontology = "MS", exact = TRUE)
res
## Object of class 'OlsSearch':
## ontolgy: MS
## query: ESI
## requested: 20 (out of 1)
## response(s): 0
There is a single exact match (default is to retrieve 20 results), that can be retrieved and coerced to a Terms
or data.frame
object with
res <- olsSearch(res)
as(res, "Terms")
## Object of class 'Terms' with 1 entries
## From the MS ontology
## MS:1000073
as(res, "data.frame")
## id
## 1 ms:http://purl.obolibrary.org/obo/MS_1000073
## iri short_form obo_id
## 1 http://purl.obolibrary.org/obo/MS_1000073 MS_1000073 MS:1000073
## label
## 1 electrospray ionization
## description
## 1 A process in which ionized species in the gas phase are produced from an analyte-containing solution via highly charged fine droplets, by means of spraying the solution from a narrow-bore needle tip at atmospheric pressure in the presence of a high electric field. When a pressurized gas is used to aid in the formation of a stable spray, the term pneumatically assisted electrospray ionization is used. The term ion spray is not recommended.
## ontology_name ontology_prefix type is_defining_ontology
## 1 ms MS class TRUE
Data from the Human Protein Atlas is available via the hpar package.
sessionInfo()
## R Under development (unstable) (2017-02-25 r72256)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/libf77blas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] ggplot2_2.2.1 dplyr_0.7.0 AnnotationHub_2.9.5
## [4] knitr_1.16 hpar_1.19.0 rols_2.5.0
## [7] MSGFplus_1.11.0 pRolocdata_1.15.0 rpx_1.13.1
## [10] MSnID_1.11.0 mzID_1.15.0 pRoloc_1.17.3
## [13] MLInterfaces_1.57.0 cluster_2.0.6 annotate_1.55.0
## [16] XML_3.98-1.7 AnnotationDbi_1.39.1 IRanges_2.11.5
## [19] S4Vectors_0.15.4 gridExtra_2.2.1 lattice_0.20-35
## [22] RforProteomics_1.15.0 BiocInstaller_1.27.2 MSnbase_2.3.3
## [25] ProtGenerics_1.9.0 BiocParallel_1.11.2 mzR_2.11.3
## [28] Rcpp_0.12.11 Biobase_2.37.2 BiocGenerics_0.23.0
## [31] BiocStyle_2.5.2
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.0 plyr_1.8.4
## [3] lazyeval_0.2.0 GSEABase_1.39.0
## [5] splines_3.4.0 ggvis_0.4.3
## [7] digest_0.6.12 foreach_1.4.3
## [9] htmltools_0.3.6 viridis_0.4.0
## [11] gdata_2.18.0 magrittr_1.5
## [13] memoise_1.1.0 doParallel_1.0.10
## [15] sfsmisc_1.1-1 limma_3.33.2
## [17] rda_1.0.2-2 R.utils_2.5.0
## [19] lpSolve_5.6.13 prettyunits_1.0.2
## [21] colorspace_1.3-2 RCurl_1.95-4.8
## [23] jsonlite_1.5 hexbin_1.27.1
## [25] graph_1.55.0 genefilter_1.59.0
## [27] lme4_1.1-13 impute_1.51.0
## [29] survival_2.41-3 iterators_1.0.8
## [31] glue_1.0.0 gtable_0.2.0
## [33] zlibbioc_1.23.0 MatrixModels_0.4-1
## [35] R.cache_0.12.0 car_2.1-4
## [37] kernlab_0.9-25 prabclus_2.2-6
## [39] DEoptimR_1.0-8 SparseM_1.77
## [41] scales_0.4.1 vsn_3.45.0
## [43] mvtnorm_1.0-6 DBI_0.6-1
## [45] viridisLite_0.2.0 xtable_1.8-2
## [47] progress_1.1.2 proxy_0.4-17
## [49] mclust_5.3 preprocessCore_1.39.0
## [51] httr_1.2.1 sampling_2.8
## [53] htmlwidgets_0.8 threejs_0.2.2
## [55] gplots_3.0.1 FNN_1.1
## [57] RColorBrewer_1.1-2 fpc_2.1-10
## [59] modeltools_0.2-21 pkgconfig_2.0.1
## [61] R.methodsS3_1.7.1 flexmix_2.3-14
## [63] nnet_7.3-12 caret_6.0-76
## [65] labeling_0.3 rlang_0.1.1.9000
## [67] reshape2_1.4.2 munsell_0.4.3
## [69] mlbench_2.1-1 biocViews_1.45.0
## [71] tools_3.4.0 msdata_0.17.0
## [73] RSQLite_1.1-2 pls_2.6-0
## [75] evaluate_0.10 stringr_1.2.0
## [77] yaml_2.1.14 ModelMetrics_1.1.0
## [79] robustbase_0.92-7 caTools_1.17.1
## [81] randomForest_4.6-12 dendextend_1.5.2
## [83] RBGL_1.53.0 nlme_3.1-131
## [85] quantreg_5.33 whisker_0.3-2
## [87] mime_0.5 R.oo_1.21.0
## [89] xml2_1.1.1 biomaRt_2.33.2
## [91] compiler_3.4.0 pbkrtest_0.4-7
## [93] curl_2.6 interactiveDisplayBase_1.15.0
## [95] e1071_1.6-8 affyio_1.47.0
## [97] tibble_1.3.3 stringi_1.1.5
## [99] highr_0.6 trimcluster_0.1-2
## [101] Matrix_1.2-10 nloptr_1.0.4
## [103] gbm_2.1.3 RUnit_0.4.31
## [105] MALDIquant_1.16.2 data.table_1.10.4
## [107] bitops_1.0-6 httpuv_1.3.3
## [109] R6_2.2.1 pcaMethods_1.69.0
## [111] affy_1.55.0 hwriter_1.3.2
## [113] KernSmooth_2.23-15 gridSVG_1.5-1
## [115] codetools_0.2-15 MASS_7.3-47
## [117] gtools_3.5.0 assertthat_0.2.0
## [119] interactiveDisplay_1.15.0 Category_2.43.0
## [121] rprojroot_1.2 diptest_0.75-7
## [123] mgcv_1.8-17 grid_3.4.0
## [125] rpart_4.1-11 class_7.3-14
## [127] minqa_1.2.4 rmarkdown_1.5
## [129] Rtsne_0.13 shiny_1.0.3
## [131] base64enc_0.1-3