Targeted data extraction methods are attractive ways to obtain quantitative peptide information from a proteomics experiment. Sequential Window Acquisition of all Theoretical Spectra (SWATH) and Data Independent Acquisition (DIA) methods increase reproducibility of acquired data because the classical precursor selection is omitted and all present precursors are fragmented. However, especially for targeted data extraction, MS coordinates (retention time information precursor and fragment masses) are required for the particular entities (peptide ions). These coordinates are usually generated in a so-called discovery experiment earlier on in the project if not available in public spectral library repositories. The quality of the assay panel is crucial to ensure appropriate downstream analysis. For that, a method is needed to create spectral libraries and to export customizable assay panels.
specL 1.38.0
Targeted proteomics is a fast evolving field in proteomics science and was even elected as the method of the year in 2012 . Especially targeted methods like SWATH (Gillet et al. 2012) open promising perspectives for for identifying and quantifying of peptides and proteins. All targeted methods have in common the need of precise MS coordinates composed of precursor mass, fragment masses, and retention time. The combination of this information is kept in so-called assays or spectra libraries. Here we present an R package able to produce such libraries out of peptide identification results (Mascot (dat), TPP (pep.xml and mzXMLs), ProteinPilot (group), Omssa (omx)). specL (Panse et al. 2015) is an easy-to-use, versatile, and flexible function, which can be integrated into already existing commercial or non-commercial analysis pipelines for targeted proteomics data analysis. Some examples of today’s pipelines are ProteinPilot combined with Peakview (AB Sciex), Spectronaut (Biognosys) or OpenSwath (Rost et al. 2014).
In the following vignette it is described how the specL package
can be used for the included data sets peptideStd
and
peptideStd.redundant
.
Since peptide identification (using, e.g., Mascot, Sequest, xTandem!,
Omssa, ProteinPilot)
usually creates result files which are
heavily redundant and therefore unsuited for spectral library building,
the search results must first be filtered. To create non-redundant
input files, we use the BiblioSpec (Frewen and MacCoss 2007) algorithm
implemented in Skyline (MacLean et al. 2010). A given search result (e.g.Â
Mascot result file) is loaded into the software Skyline and is redundancy
filtered. The ‘Skyline workflow step’ provides two sqlite readable
files
as output named *.blib
and *.redundant.blib
.
These files are used as ideal input for this packages.
Note here, that Skyline is very flexible when it comes to peptide
identification results. It means with Skyline you can build the spectrum
library files for almost all search engines (even from other spectrum
library files such as spectraST (Lam et al. 2008)).
The first step which has to be performed on the R shell is loading specL library.
library(specL)
packageVersion('specL')
## [1] '1.38.0'
for demonstration, specL contains the two data sets, namely peptideStd
and
peptideStd.redundant
. This data set
comes from two standard-run experiments routinely
used to check if the liquid chromatographic system is still working
appropriately. The sample consists of a digest of the Fetuin protein
(Bos taurus, uniprot id: P12763). 40 femtomole are loaded on the column.
Mascot was used to search and identify the respective peptides.
summary(peptideStd)
## Summary of a "psmSet" object.
## Number of precursor:
## 137
## Number of precursors in Filename(s)
## 0140910_01_fetuin_400amol_1.raw 21
## 0140910_07_fetuin_400amol_2.raw 116
## Number of annotated precursor:
## 0
For both peptideStd
, peptideStd.redandant
data sets the
Skyline software was used to generate the bibliospec files which
contain the peptide sequences with the respective peptide spectrum
match (PSM). The specL::read.bibliospec
function was used
to read the blib files into R.
The from read.bibliospec
generated object has its own plot functions.
The LC-MS map graphs peptide mass versus retention time.
# plot(peptideStd)
plot(0,0, main='MISSING')
The individual peptide spectrum match (psm) is displayed by using the
protViz peakplot
function.
demoIdx <- 40
# str(peptideStd[[demoIdx]])
#res <- plot(peptideStd[[demoIdx]], ion.axes=TRUE)
plot(0,0, main='MISSING')
Alternatively, Mascot search result files (dat) can be used by applying
protViz perl script
protViz\-\_mascotDat2RData.pl
.
The Perl script can be found in the exec directory of the protViz package. The mascot mod_file can be found in the configurations of the mascot server. An example on our Linux shell looks as follows:
$ /usr/local/lib/R/site-library/protViz/exec/protViz_mascotDat2RData.pl \
-d=/usr/local/mascot/data/20130116/F178287.dat \
-m=mod_file
mascotDat2RData.pl
requires the Mascot server mod\_file
keeping
all the configured modification.
Once the {erl script is finished, the resulting RData file can be read into the R session using load
.
Next, the variable modifications, and the S3 psmSet object has to be generated. This can be done by using specL:::.mascot2psmSet
specL:::.mascot2psmSet
## function (dat, mod, mascotScoreCutOff = 40)
## {
## res <- lapply(dat, function(x) {
## x$MonoisotopicAAmass <- protViz::aa2mass(x$peptideSequence)[[1]]
## modString <- as.numeric(strsplit(x$modification, "")[[1]])
## modIdx <- which(modString > 0) - 1
## modString.length <- length(modString)
## x$varModification <- mod[modString[c(-1, -modString.length)] +
## 1]
## if (length(modIdx) > 0) {
## warning("modified varModification caused.")
## x$varModification[modIdx] <- x$varModification[modIdx] -
## x$MonoisotopicAAmass[modIdx]
## }
## rt <- x$rtinseconds
## x <- c(x, rt = rt, fileName = "mascot")
## class(x) <- "psm"
## return(x)
## })
## res <- res[which(unlist(lapply(dat, function(x) {
## x$mascotScore > mascotScoreCutOff && length(x$mZ) > 10
## })))]
## class(res) <- "psmSet"
## return(res)
## }
## <bytecode: 0x56227d4e6e48>
## <environment: namespace:specL>
If you are processing Mascot result files, you can continue reading in the section genSwathIonLib
.
However, please note due do the high potential redundancy of peptide spectrum matches in a database search approach, it might not result in useful ion library for targeted data extraction unless redundancy filtering is handled. However, in a future release, a redundancy filter algorithm might be proposed to resolve this problem.
The information to which protein a peptide-spectrum-match belongs (PSM)
is not stored by BiblioSpec. Therefore specL provides the annotate.protein\_id
function which uses R’s internal grep
to ‘reassign’ the protein information. Therefore a fasta
object has
to be loaded into the R system using read.fasta
of the
seqinr package. For this, not necessarily, the same fasta
file needs to be provided as in the original database
search.
The following lines demonstrate a simple sanity check with a single FASTA style formatted protein entry. Also it demonstrates the use case how to identify entries in the R-object which are from one or a few proteins of interest.
irtFASTAseq <- paste(">zz|ZZ_FGCZCont0260|",
"iRT_Protein_with_AAAAK_spacers concatenated Biognosys\n",
"LGGNEQVTRAAAAKGAGSSEPVTGLDAKAAAAKVEATFGVDESNAKAAAAKYILAGVENS",
"KAAAAKTPVISGGPYEYRAAAAKTPVITGAPYEYRAAAAKDGLDAASYYAPVRAAAAKAD",
"VTPADFSEWSKAAAAKGTFIIDPGGVIRAAAAKGTFIIDPAAVIRAAAAKLFLQFGAQGS",
"PFLK\n")
Tfile <- file(); cat(irtFASTAseq, file = Tfile);
fasta.irtFASTAseq <- read.fasta(Tfile, as.string=TRUE, seqtype="AA")
close(Tfile)
As expected, the peptideStd
data, e.g., our demo object, does not contain any protein information yet.
peptideStd[[demoIdx]]$proteinInformation
## [1] ""
The protein information can be added as follow:
peptideStd <- annotate.protein_id(peptideStd,
fasta=fasta.irtFASTAseq)
## start protein annotation ...
## time taken: 0.000534836451212565 minutes
The following lines now show the object indices of those entries which do have protein information now.
(idx <- which(unlist(lapply(peptideStd,
function(x){nchar(x$proteinInformation)>0}))))
## [1] 1 2 3 4 5 6
As expected, there are now a number of peptide sequences annotated with the protein ID.
peptideStd[[demoIdx]]$proteinInformation
## [1] "zz|ZZ_FGCZCont0260|"
Of note, that the default digest pattern is defined as
digestPattern = "(([RK])|(^)|(^M))"
for tryptic peptides. For other enzymes, the pattern has to
be adapted. For example, for semi-tryptic identifications, use
digestPattern = ""
.
genSwathIonLib
is the main contribution of the
specL package. It generates the spectral library used in a targeted data extraction workflow from a mass spectrometric
measurement. Generating the ion library using iRT peptides is highly recommended as described. However if you have no iRT peptide, continue reading in section noiRT.
Generation of the spec Library with default (see Table) settings.
res.genSwathIonLib <- genSwathIonLib(data = peptideStd,
data.fit = peptideStd.redundant)
## normalizing RT ...
## found 7 iRT peptide(s) in s:\p1239\Proteomics\QEXACTIVE_3\ctrachse_20140910_Nuclei_diff_extraction_methods\20140910_01_fetuin_400amol_1.raw
## found 7 iRT peptide(s) in s:\p1239\Proteomics\QEXACTIVE_3\ctrachse_20140910_Nuclei_diff_extraction_methods\20140910_07_fetuin_400amol_2.raw
## building model ...
## generating ion library ...
## start generating specLSet object ...
## length of findNN idx 137
## length of genSwathIonLibSpecL 137
## time taken: 0.242818832397461 secs
## length of genSwathIonLibSpecL after fragmentIonRange filtering 137
genSwathIonLib default settings
parameter | description | value |
---|---|---|
max.mZ.Da.error | max ms2 tolerance | 0.1 |
topN | the n most intense fragment ion | 10 |
fragmentIonMzRange | mZ range filter of fragment ion | c(300, 1800) |
fragmentIonRange | min/max number of fragment ions | c(5,100) |
fragmentIonFUN} | desired fragment ion types | b1+,y1+,b2+,y2+,b3+,y3+ |
summary(res.genSwathIonLib)
## Summary of a "specLSet" object.
##
## Parameter:
##
## Number of precursor (q1 and peptideModSeq) = 137
## Number of unique precursor
## (q1.in-silico and peptideModSeq) = 126
## Number of iRT peptide(s) = 8
## Which std peptides (iRTs) where found in which raw files:
## 0140910_01_fetuin_400amol_1.raw GAGSSEPVTGLDAK
## 0140910_01_fetuin_400amol_1.raw TPVITGAPYEYR
## 0140910_01_fetuin_400amol_1.raw VEATFGVDESNAK
## 0140910_07_fetuin_400amol_2.raw ADVTPADFSEWSK
## 0140910_07_fetuin_400amol_2.raw DGLDAASYYAPVR
## 0140910_07_fetuin_400amol_2.raw GTFIIDPGGVIR
## 0140910_07_fetuin_400amol_2.raw LFLQFGAQGSPFLK
## 0140910_07_fetuin_400amol_2.raw TPVISGGPYEYR
##
## Number of transitions frequency:
## 4 1
## 5 5
## 6 10
## 7 7
## 8 18
## 9 32
## 10 64
##
## Number of annotated precursor = 6
## Number of file(s)
## 2
##
## Number of precursors in Filename(s)
## 0140910_01_fetuin_400amol_1.raw 21
## 0140910_07_fetuin_400amol_2.raw 116
##
## Misc:
##
## Memory usage = 676976 bytes
The determined mass spec coordinates of the selected tandem mass spectrum
demoIdx
look like this:
res.genSwathIonLib@ionlibrary[[demoIdx]]
## An "specL" object.
##
##
## content:
## group_id = GAGSSEPVTGLDAK.2
## peptide_sequence = GAGSSEPVTGLDAK
## proteinInformation = zz|ZZ_FGCZCont0260|
## q1 = 644.8219
## q1.in_silico = 1288.638
## q3 = 800.4497 604.3285 1016.522 503.2805 929.4925 400.7282
## 333.176 1160.581 703.3948 343.1235
## q3.in_silico = 800.4512 604.3301 1016.526 503.2824 929.4938
## 400.7295 333.1769 1160.579 703.3985 343.1615
## prec_z = 2
## frg_type = y y y y y y y y y b
## frg_nr = 8 6 10 5 9 8 3 12 7 8
## frg_z = 1 1 1 1 1 2 1 1 1 2
## relativeFragmentIntensity = 100 21 19 12 10 9 9 8 8 6
## irt = -0.95
## peptideModSeq = GAGSSEPVTGLDAK
## mZ.error = 0.001514 0.00156 0.003685 0.001914 0.001318
## 0.001313 0.000856 0.001846 0.003686 0.0380015
## uclei_diff_extraction_methods\20140910_01_fetuin_400amol_1.raw
## score = 41.54902
##
## size:
## Memory usage: 4776 bytes
It can be displayed using the function.
plot(res.genSwathIonLib@ionlibrary[[demoIdx]])