Contents

1 Timeline Plots

The package contains only a subset of the most important data generated over a period of five years. To get an impression an overview of all annotated sample (S) and workunits (W) in the B-Fabric system, Türker et al. (2010), is graphed in the timeline plots.

the NGS data p1644

the mass spec data p1875

2 Make Data (replaces make-data.R)

2.1 NL42_100K.fastq.gz

Sample NGS data contains 100K merged MiSeq reads that demonstrate the linkage between nanobodies (NB) and flycodes (FC) in FASTQ.

NL42_100K <- NestLink:::.getReadsFromFastq("inst/extdata/NL42_100K.fastq.gz")
save(NL42_100K, file="inst/extdata/NestLink_NL42_100K.RData")

2.2 knownNB.txt

An optional part of the NestLink workflow is the usage of known nanobodies in the sequencing experiment to estimate sensitity and specificity levels. This example file contains nucleotide sequences of nanobodies that should be detectable in this experiment. In the later workflow, these nanabodies are highlighted and labeled as known NB.

2.3 nanobodyFlycodeLinkage.RData

NGS ground truth derived by applying the function runNGSAnalysis to the two previous files.

expFile <- query(eh, c("NestLink", "NL42_100K.fastq.gz"))[[1]]
expect_true(file.exists(expFile))
scratchFolder <- tempdir()
setwd(scratchFolder)

knownNB_File <- query(eh, c("NestLink", "knownNB.txt"))[[1]]
knownNB_data <- read.table(knownNB_File,
                           sep='\t',
                           header = TRUE,
                           row.names = 1,
                           stringsAsFactors = FALSE)

knownNB <- Biostrings::translate(DNAStringSet(knownNB_data$Sequence))
names(knownNB) <- rownames(knownNB_data)
knownNB <- sapply(knownNB, toString)

param <- list()
param[['NB_Linker1']] <- "GGCCggcggGGCC"
param[['NB_Linker2']] <- "GCAGGAGGA"
param[['ProteaseSite']] <- "TTAGTCCCAAGA"
param[['FC_Linker']] <- "GGCCaaggaggcCGG"
param[['knownNB']] <- knownNB
param[['nReads']] <- 100
param[['minRelBestHitFreq']] <- 0.8 
param[['minConsensusScore']] <- 0.9
param[['maxMismatch']] <- 1
param[['minNanobodyLength']] <- 348
param[['minFlycodeLength']] <- 33
param[['FCminFreq']] <- 1

nanobodyFlycodeLinkage.RData <- runNGSAnalysis(file = expFile[1], param)

2.4 NB.tryptic and FC.tryptic

Both files are the output of the previous NGS step generating the linkage between NBs and FCs.

The files are used to demonstrate the detectability of the AA sequences.

The wrapper functions are extended by the SSRC prediction and the parent ion mass (pim) determined by using protViz.

The column ESP_Prediction was generated by using the service from https://genepattern.broadinstitute.org, see also Fusaro et al. (2009).

library(NestLink)
NB <- getNB()
FC <- getFC()

The first ten lines of each table is shown below:

peptide ESP_Prediction cond pim ssrc peptideLength
AAAGITYYADSVK 0.82378 NB 1329.6685 21.93845 13
AACCPVAR 0.39342 NB 904.4127 5.56465 8
AADPGSWGQGTPVTVSSELK 0.64844 NB 1986.9767 26.10345 20
AADYYYGMNHWGK 0.15954 NB 1575.6685 24.80345 13
AANPFGLVQGFGSWGK 0.44514 NB 1635.8278 40.19691 16
AAPDYWGQGTPVTVSSELK 0.39622 NB 2005.9865 31.76845 19
peptide ESP_Prediction cond pim ssrc peptideLength
120 GSAAAAADSWLTVR 0.75450 FC 1375.696 27.80445 14
121 GSAAAAATDWLTVR 0.76422 FC 1389.712 29.00445 14
122 GSAAAAATGWLTVR 0.65522 FC 1331.707 28.60445 14
123 GSAAAAATVWLR 0.65496 FC 1173.637 29.10445 12
124 GSAAAAAYEWLTVR 0.72754 FC 1465.743 33.10445 14
125 GSAAAADAAWQEGGR 0.53588 FC 1417.645 11.70445 15

2.5 F255744.RData and WU160118.RData

2.5.1 Mass spec data

the mass spec files below are available through ProteomeXchange PXD009301.

2.5.2 Compute the peptide spectrum matches

the mass spectra were assigned to peptide sequences using the most important parameter listed in the table below and the Matrix Science’s Mascot Server Perkins et al. (1999) version 2.5.

Parameter Value
COM 170819_MS1708116_NL5idx4to5_Competition2BG_db8_db10_swissprot_d_merge
FASTA 1 p1875_db8_20160704.fasta
FASTA 2 p1875_db10_20170817.fasta
TOL 10
TOLU ppm
ITOL 0.6
ITOLU Da
USERNAME egloffp
CHARGE 2+
IT_MODS Deamidated (NQ),Oxidation (M)
INSTRUMENT ESI-TRAP
release fgcz_swissprot_d_20140403.fasta

The results were exported as XML. The XML was parsed and exported as data.frame using protViz Panse and Grossmann (2019) function protViz:::as.data.frame.mascot.

2.5.3 Workflow available through B-Fabric

The above-described results and workflows are available for registered users in B-Fabric. However, it is not necessary to access B-Fabric in order to use this package.

2.6 PGexport2_normalizedAgainstSBstandards_Peptides.csv

contains mass spectrometry based label free quantitative (LFQ) results of nanobodies expressed in SMEG and COLI species.

  • Workunit : 158716 - QEXACTIVEHF_1

    • 20170919_16_62465_nl5idx1-3_6titratecoli.raw
    • 20170919_05_62465_nl5idx1-3_6titratecoli.raw
  • Workunit : 158717 - QEXACTIVEHF_1

    • 20170919_14_62466_nl5idx1-3_7titratesmeg.raw
    • 20170919_09_62466_nl5idx1-3_7titratesmeg.raw

Two LC-MS/MS runs were aligned in Progenesis QI (Nonlinear Dynamics) with an alignment score of 93.1 %, followed by peak picking with an allowed ion charge of +2 to +5.

3 Uploading to S3

#!/bin/bash

aws --profile AnnotationContributor s3 cp NestLink/F255744.RData s3://annotation-contributor/NestLink/F255744.RData --acl public-read

aws --profile AnnotationContributor s3 cp NestLink/WU160118.RData s3://annotation-contributor/NestLink/WU160118.RData --acl public-read

aws --profile AnnotationContributor s3 cp NestLink s3://annotation-contributor/NestLink --recursive --acl public-read

4 Overview/Getting started using Bioconductor ExperimentHub

load metadata

fl <- system.file("extdata", "metadata.csv", package='NestLink')
kable(metadata <- read.csv(fl, stringsAsFactors=FALSE))
Title Description BiocVersion Genome SourceType SourceUrl SourceVersion Species TaxonomyId Coordinate_1_based DataProvider Maintainer RDataClass DispatchClass RDataPath Tags Notes
Sample NGS NB FC linkage data Sample NGS demonstratig the linkage between nanobodies (NB) and flycodes (FC). data in FASTQ 3.9 NA FASTQ https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch DNAStringSet FilePath NestLink/NL42_100K.fastq.gz NA md5=4a13c5c61a5b29f4fd8830c1c15419b6;
Flycodes tryptic digested Flycodes tryptic digested amino acid sequences with ESP_Prediction score. 3.9 NA TXT https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch data.frame FilePath NestLink/FC.tryptic NA md5=f6faa7458350ce1805bec30e9ffdeaae;
Nanobodies tryptic digested Nanobodies tryptic digested amino acid sequences with ESP_Prediction score. 3.9 NA TXT https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch data.frame FilePath NestLink/NB.tryptic NA md5=db85a806c5151113536b710d566d9cf3;
FASTA as ground-truth for unit testing FASTA data as ground-truth for unit testing. 3.9 NA RData https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch data.frame FilePath NestLink/nanobodyFlycodeLinkage.RData NA md5=57b2756fb0ebcf73d4036846580cb5b2;
Known nanobodies Known nanobodies as nucleic acid sequences. 3.9 NA TXT https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Lennart Opitz lopitz@fgcz.ethz.ch data.frame FilePath NestLink/knownNB.txt NA md5=003bf82c58f0a96a2bd945d171dc907c;
Quantitaive results for SMEG and COLI Mass spectrometry based label free quantitative results of nanobodies expressed in SMEG and COLI species. 3.9 NA CSV https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875 Nov 28 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch data.frame FilePath NestLink/PGexport2_normalizedAgainstSBstandards_Peptides.csv NA md5=0ca525d0a65d4938f0cbc785b7e0d2d3; bfabric WU158716, WU158717
F255744 Mascot Search result F255744 peptide spectrum matches (PSMs) of Flycodes. 3.9 NA TXT https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-resource.html?id=409912 Dec 13 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch data.frame FilePath NestLink/F255744.RData NA md5=d5e4d13e9ecba4231d1808c6bb0bb454; R409912
WU160118 Mascot Search results WU160118 peptide spectrum matches (PSMs) Flycodes. 3.9 NA TXT https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-workunit.html?id=160118 Dec 13 2018 NA NA NA Functional Genomics Center Zurich (FGCZ) Markus Seeger m.seeger@imm.uzh.ch, Pascal Egloff p.egloff@imm.uzh.ch, Christian Panse cp@fgcz.ethz.ch data.frame FilePath NestLink/WU160118.RData NA md5=a17f4505e322d440bc0e9edf8e5277bb; bfabric WU160118

query and load NestLink package data from aws s3

library(ExperimentHub)

eh <- ExperimentHub(); 
query(eh, "NestLink")
## ExperimentHub with 8 records
## # snapshotDate(): 2019-05-07 
## # $dataprovider: Functional Genomics Center Zurich (FGCZ)
## # $species: NA
## # $rdataclass: data.frame, DNAStringSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH2063"]]' 
## 
##            title                                 
##   EH2063 | Sample NGS NB FC linkage data         
##   EH2064 | Flycodes tryptic digested             
##   EH2065 | Nanobodies tryptic digested           
##   EH2066 | FASTA as ground-truth for unit testing
##   EH2067 | Known nanobodies                      
##   EH2068 | Quantitaive results for SMEG and COLI 
##   EH2069 | F255744 Mascot Search result          
##   EH2070 | WU160118 Mascot Search results
load(query(eh, c("NestLink", "F255744.RData"))[[1]])
dim(F255744)
## [1] 15655    21
load(query(eh, c("NestLink", "WU160118.RData"))[[1]])
dim(WU160118)
## [1] 128390     22

5 Session info

Here is the compiled output of sessionInfo():

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.22                  scales_1.0.0               
##  [3] ggplot2_3.1.1               NestLink_1.1.0             
##  [5] ShortRead_1.43.0            GenomicAlignments_1.21.1   
##  [7] SummarizedExperiment_1.15.0 DelayedArray_0.11.0        
##  [9] matrixStats_0.54.0          Biobase_2.45.0             
## [11] Rsamtools_2.1.2             GenomicRanges_1.37.0       
## [13] GenomeInfoDb_1.21.0         BiocParallel_1.19.0        
## [15] protViz_0.4.0               gplots_3.0.1.1             
## [17] Biostrings_2.53.0           XVector_0.25.0             
## [19] IRanges_2.19.0              S4Vectors_0.23.0           
## [21] ExperimentHub_1.11.1        AnnotationHub_2.17.2       
## [23] BiocFileCache_1.9.0         dbplyr_1.4.0               
## [25] BiocGenerics_0.31.0         BiocStyle_2.13.0           
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0                    bit64_0.9-7                  
##  [3] gtools_3.8.1                  shiny_1.3.2                  
##  [5] assertthat_0.2.1              interactiveDisplayBase_1.23.0
##  [7] highr_0.8                     BiocManager_1.30.4           
##  [9] latticeExtra_0.6-28           blob_1.1.1                   
## [11] GenomeInfoDbData_1.2.1        yaml_2.2.0                   
## [13] pillar_1.3.1                  RSQLite_2.1.1                
## [15] lattice_0.20-38               glue_1.3.1                   
## [17] digest_0.6.18                 RColorBrewer_1.1-2           
## [19] promises_1.0.1                colorspace_1.4-1             
## [21] plyr_1.8.4                    htmltools_0.3.6              
## [23] httpuv_1.5.1                  Matrix_1.2-17                
## [25] pkgconfig_2.0.2               bookdown_0.9                 
## [27] zlibbioc_1.31.0               purrr_0.3.2                  
## [29] xtable_1.8-4                  gdata_2.18.0                 
## [31] later_0.8.0                   tibble_2.1.1                 
## [33] withr_2.1.2                   lazyeval_0.2.2               
## [35] magrittr_1.5                  crayon_1.3.4                 
## [37] mime_0.6                      memoise_1.1.0                
## [39] evaluate_0.13                 hwriter_1.3.2                
## [41] tools_3.6.0                   stringr_1.4.0                
## [43] munsell_0.5.0                 AnnotationDbi_1.47.0         
## [45] compiler_3.6.0                caTools_1.17.1.2             
## [47] rlang_0.3.4                   grid_3.6.0                   
## [49] RCurl_1.95-4.12               rappdirs_0.3.1               
## [51] labeling_0.3                  bitops_1.0-6                 
## [53] rmarkdown_1.12                gtable_0.3.0                 
## [55] codetools_0.2-16              DBI_1.0.0                    
## [57] curl_3.3                      R6_2.4.0                     
## [59] dplyr_0.8.0.1                 bit_1.1-14                   
## [61] KernSmooth_2.23-15            stringi_1.4.3                
## [63] Rcpp_1.0.1                    tidyselect_0.2.5             
## [65] xfun_0.6

References

Fusaro, V. A., D. R. Mani, J. P. Mesirov, and S. A. Carr. 2009. “Prediction of high-responding peptides for targeted protein assays by mass spectrometry.” Nat. Biotechnol. 27 (2):190–98.

Panse, Christian, and Jonas Grossmann. 2019. protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Perkins, David N., Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. 1999. “Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data.” Electrophoresis 20 (18). Wiley:3551–67. https://doi.org/10.1002/(sici)1522-2683(19991201)20:18<3551::aid-elps3551>3.0.co;2-2.

Türker, Can, Fuat Akal, Dieter Joho, Christian Panse, Simon Barkow-Oesterreicher, Hubert Rehrauer, and Ralph Schlapbach. 2010. “B-Fabric: The Swiss Army Knife for Life Sciences.” In Proceedings of the 13th International Conference on Extending Database Technology - EDBT 10. ACM Press. https://doi.org/10.1145/1739041.1739135.