# 1 Introduction

biodb is able to handle in-house compound databases either stored inside a CSV file, using the comp.csv.file, or inside an SQLite file, using comp.sqlite connectors.

Both connectors accept the creation of new entries through biodb methods. The CSV file connector is able to read any CSV file in input, while the SQLite connector is only able to interact with a database file created by biodb.

Inside vignette Manipulating entry objects you can learn how to create a new empty SQLite of CSV file connector in order to copy entries into it entries from another connector, and thus create your own database.

To start we create an instance of the BiodbMain class:

mybiodb <- biodb::newInst()
## INFO  [08:58:47.312] Loading definitions from package biodb version 1.0.4.

# 2 CSV File connector

In order to facilitate the loading of the file, you should use the tabulation character as columns separator and name the columns of your file with biodb standard field names. However, if your CSV file does not respect the biodb standard, you have also the possibility to declare a custom column separator character and the mapping between column names of your file and biodb field names just before loading the file.

Once the connection to the file is defined, you can use the connector to your in-house file as any other compound database connector.

## 2.1 Creating a connector

In order to create a connector to a CSV database, you have to provide the path to your CSV file. This is done with the url parameter of the createConn() method.

If your CSV file respects the biodb defaults (see below), then no further information is required.

If your CSV file does not respect the biodb standard, then you will have to modify the defaults on the connector instance, before the CSV file is loaded. Thus it has to be done immediately after the connector creation.

In the following sub-sections we are going to see how to load a biodb standard CSV file and a custom CSV file.

Here is a biodb standard CSV file containing an extract of the ChEBI database:

csvUrl <- system.file("extdata", "chebi_extract.tsv", package='biodb')

See table 1 for the content of this file.

Table 1: Excerpt from compound database TSV file.
accession formula monoisotopic.mass molecular.mass kegg.compound.id name smiles description
1018 C2H8AsNO3 168.97201 169.012 C07279 2-Aminoethylarsonate NCC[As](O)(O)=O
1390 C8H8O2 136.05243 136.148 C06224 3,4-Dihydroxystyrene Oc1ccc(C=C)cc1O
1456 C3H9NO2 91.06333 91.109 C06057 3-aminopropane-1,2-diol NC[C@H](O)CO
1549 C3H5O3R 89.02387 89.070 C03834 3-hydroxymonocarboxylic acid OC([*])CC(O)=O
1894 C5H11NO 101.08406 101.147 C10974 4-Methylaminobutanal CNCCCC=O
1932 C6H6NR 92.05002 92.119 C03084 4-Substituted aniline Nc1ccc([*])cc1

This CSV file respects the biodb defaults. The columns separator is the tabulation character. Column names use biodb standard entry field names. String values may be quoted with double quotes (").

We instantiate the connector by passing the URL to the factory:

conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl)

We will later use this connector to run the examples of this vignette.

Here is a custom CSV file containing the same extract of the ChEBI database than with the biodb standard CSV file:

csvUrl2 <- system.file("extdata", "chebi_extract_custom.csv", package='biodb')

Only the columns separator character and some column names have been changed.

We create now a connector for this custom CSV file:

conn2 <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl2)

At this step the file has not yet been loaded. We can thus customize the connector in order for the CSV file parsing to proceed correctly. The effective loading of the CSV file will happen when you run a method of the connector that requires the data.

The first step to customize your connector is to set the separator character:

conn2$setCsvSep(';') Then you may change the quote characters: conn2$setCsvQuote('')

Here we specify with an empty string that this CSV file does not use quotes for character values.

Finally you have to map each custom column name with the name of a biodb entry field. For this you call the setField() method for each column name, giving as first argument the biodb field name and as second argument the column name. In our case this gives:

conn2$setField('accession', 'ID') ## INFO [08:58:47.702] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract_custom.csv". conn2$setField('kegg.compound.id',  'kegg')
conn2$setField('monoisotopic.mass', 'mass') conn2$setField('molecular.mass',    'molmass')

You will notice that with the first call to setField() an information message tells you that the CSV file has been loaded.

It is possible to associate several column names to a single biodb field, in which case you have to provide a character vector containing your column names. The values of the resulting biodb field will be the concatenation of the values of your selected columns, in the order specified. Because of the concatenation of your values, the type of the targeted biodb field must be character. This is particularly useful for the accession field, which must correspond to a unique entry inside your CSV file. Depending on your CSV file, you may need to associate several columns to create a valid accession value that identifies a unique entry.

## 2.2 Retrieving entries

Retrieving entries is done as with any other connector in biodb, using their accession numbers. The returned value is a list of BiodbEntry objects:

entries <- conn$getEntry(c('1018', '1456', '16750', '64679')) ## INFO [08:58:47.732] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract.tsv". entries ## [[1]] ## Biodb Compound CSV File entry instance 1018. ## ## [[2]] ## Biodb Compound CSV File entry instance 1456. ## ## [[3]] ## Biodb Compound CSV File entry instance 16750. ## ## [[4]] ## Biodb Compound CSV File entry instance 64679. From a list of entries, you can obtain a data frame with their values: entriesDf <- mybiodb$entriesToDataframe(entries)

See table 2 for the content of this data frame.

Table 2: Some entries from the compound database.
accession formula monoisotopic.mass molecular.mass kegg.compound.id name smiles description comp.csv.file.id
1018 C2H8AsNO3 168.97201 169.0120 C07279 2-Aminoethylarsonate NCC[As](O)(O)=O 1018
1456 C3H9NO2 91.06333 91.1090 C06057 3-aminopropane-1,2-diol NC[C@H](O)CO 1456
16750 C10H13N5O5 283.09170 283.2409 C00387 guanosine Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond. 16750
64679 C9H18NO11P 347.06180 347.2131 NA O-(alpha-D-mannose-1-phosphoryl)-L-serine N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O A mannose phosphate in which in which the phosphate group of alpha-D-mannose 1-phosphate is esterified by the alcoholic hydroxy group of L-serine. 64679

See vignette Manipulating entry objects to know everything you can do with biodb entry objects and also the help page of the class ?biodb::BiodbEntry.

## 2.3 Searching for entries

It is possible to search for entries by mass inside a compounds database:

conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1))) ## [1] 16750 35485 40304 The function returns a list of accession numbers that you can use with the getEntry() method to retrieve full entry objects. The tolerance can also be expressed in PPM: conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, ppm=10)))
## [1] 16750 35485 40304

or with a range:

conn$searchForEntries(list(monoisotopic.mass=list(min=283.091, max=283.093))) ## [1] 16750 35485 40304 You can set a maximum to the number of entries returned with the max.results parameter: conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2)
## [1] 16750 35485

To get a list of all possible mass fields in biodb, run:

mybiodb$getEntryFields()$getFieldNames(type='mass')
## [1] "average.mass"      "molecular.mass"    "monoisotopic.mass"
## [4] "nominal.mass"

To get information on these fields run:

mybiodb$getEntryFields()$get(c('monoisotopic.mass', 'nominal.mass'))
## $monoisotopic.mass ## Entry field "monoisotopic.mass". ## Description: Monoisotopic mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass of the primary isotope of the elements including the mass defect (mass difference between neutron and proton, and nuclear binding energy). Used with high resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass. ## Class: double. ## Type: mass. ## Cardinality: one. ## Aliases: exact.mass. ## ##$nominal.mass
## Entry field "nominal.mass".
##   Description: Nominal mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass number of the most abundant isotope of each atom. Typically used with low resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
##   Class: integer.
##   Type: mass.
##   Cardinality: one.
##   Aliases: NA.

To check if a connector is searchable by a field, use the following method:

conn$isSearchableByField('monoisotopic.mass') ## [1] TRUE To get the list of searchable fields for a connector, run: conn$getSearchableFields()
## [1] "name"              "monoisotopic.mass" "molecular.mass"

Entries are also searchable by name:

conn$searchForEntries(list(name='deoxyguanosine')) ## [1] 40304 And it is possible to combine a search by mass with a search by name: conn$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))
## [1] 16750 40304

## 2.4 Annotation of an MS file

Your in-house chemical database can be used to annotate a mass spectrum, using a data frame or a vector as input. Annotation is done using the annotateMzValues() method, which is a generic method. It is thus available for all compound databases that allow search on masses. You will obtain a new data frame with appended columns taken from the chemical database.

Here is an input data frame example with M/Z values in a column:

msTsv <- system.file("extdata", "ms.tsv", package='biodb')
mzDf <- read.table(msTsv, header=TRUE, sep="\t")

See table 3 for the content of the input.

Table 3: Input M/Z values.
mz rt
282.0839 334
283.0623 872
346.0546 536
821.3964 740

We run the annotation with the annotateMzValues() method:

annotDf <- conn$annotateMzValues(mzDf, mz.tol=1e-3, ms.mode='neg', mz.tol.unit='plain', fields=c('accession', 'name', 'formula', 'molecular.mass', 'monoisotopic.mass'), prefix='mydb.', fieldsLimit=1) See table 4 for the results. Inside this table, the values coming from the database entry fields have been prefixed with the value provided inside the prefix parameter. The default value of this parameter would be the name of the database but you can set it to any value you like. The first parameter is the input, as a data frame or a numeric vector. In case of a data frame the column containing the M/Z values must be named mz or you have to specify its name using the mz.col parameter. The mz.tol and mz.tol.unit parameters are used to set the tolerance, see the manual of the class BiodbCompounddbConn. You can set the mass field to use in the database with the mass.field parameter (default is monoisotopic.mass). By default all entry fields from the database will be copied inside the output data frame, but you can restrict to a custom set of fields using the fields parameter. The fieldsLimit parameter is used to limit the number of values output for fields that may contain more than one value. Here it is used for the 'name' field, which may content more than one name for each entry. By setting the parameter to 1 we select only the first name for each entry. You will find a complete description of this method and other compound methods by running ?biodb::BiodbCompounddbConn. Table 4: The annotated mass spectrum Columns prefixed with “mydb.” come from the compound database. mz rt mydb.accession mydb.formula mydb.molecular.mass mydb.monoisotopic.mass mydb.name 282.0839 334 16750 C10H13N5O5 283.2409 283.0917 guanosine 282.0839 334 35485 C10H13N5O5 283.2409 283.0917 adenosine 1-oxide 282.0839 334 40304 C10H13N5O5 283.2407 283.0917 8-hydroxy-2’-deoxyguanosine 283.0623 872 NA NA NA NA NA 346.0546 536 64679 C9H18NO11P 347.2131 347.0618 O-(alpha-D-mannose-1-phosphoryl)-L-serine 821.3964 740 15939 C42H62O16 822.9321 822.4038 glycyrrhizinic acid See also vignette In-house mass spectra database for annotation using a mass spectra database. # 3 SQLite connector The SQLite connector operates in the same way as the CSV file connector, except for the instantiation step for which it needs an SQLite file as input instead of a CSV file. Moreover the SQLite file needs to be in biodb format (the relations need to be created by biodb). Here is a biodb SQLite compounds file, already filled with entries: sqliteFile <- system.file("extdata", "generated", "chebi_extract.sqlite", package='biodb') We create a connector from this file: sqliteConn <- mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteFile) We can search inside this database the same way we have been searching inside the CSV file database: sqliteConn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))
## [1] "16750" "35485" "40304"

# 4 Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:

mybiodb\$terminate()
## INFO  [08:58:49.076] Closing BiodbMain instance...
## INFO  [08:58:49.079] Connector "comp.csv.file" deleted.
## INFO  [08:58:49.081] Connector "comp.csv.file.1" deleted.
## INFO  [08:58:49.083] Connector "comp.sqlite" deleted.