Making Organism packages

by Marc Carlson


Making Organism Packages is a straightforward process using the helper functions makeOrgPackageFromNCBI() and makeOrgPackage(). If your package is available at NCBI with an identifiable NCBI Taxonomy ID you can try makeOrgPackageFromNCBI(). However, even if this fails, the second and more general makeOrgPackage() function will allow you to make a database package using only data.frames of data.

Making use of makeOrgPackageFromNCBI()

The makeOrgPackageFromNCBI() function, will take several different arguments to help declare who made the package and what species it is for. But the most important arguement is the tax_id arguement. That arguement is uses to search for gene records from NCBI that go with that species. So when calling this function it is important to choose the correct NCBI taxonomny ID for this argument. You can look up information about the taconomy ID at NCBIs website. Here is an example of calling this function to make a package for zebrafinch.

makeOrgPackageFromNCBI(version = "0.1",
                       author = "Some One <>",
                       maintainer = "Some One <>",
                       outputDir = ".",
                       tax_id = "59729",
                       genus = "Taeniopygia",
                       species = "guttata")

Making use of makeOrgPackage()

Sometimes you may not find what you need at NCBI, most commonly this is because they may just not have enough data about the organism you are interested in. But often other resources will have annotation data that you want to make into an organism package. When this happens you can use the much more general makeOrgPackage() function. This function takes more arguments, but it does not rely on NCBI in order to run. We do however still ask for you to provice a tax ID for the metadata (even though we are not using it to look up data from NCBI). Many of the other arguments are also the same as the makeOrgPackageFromNCBI() function. But a key difference is that the 1st argument for this function is (…). For that argument, we want you to provide named arguments corresponding to data.frames of data. Each named argument will become a table name in the resulting database, and each field name (for the data.frames) will become the field names of the database as well as the names looked up by the columns() and keytypes() methods. With the exception of any table that is named by the goTable argument (more on this below), there are not too many restrictions on what kind of data you can put into the data.frame. But one rule you must follow is that the 1st collumn of each data.frame has to correspond to a central gene ID and be labeled “GID”.

Finally, the goTable method is also new. That argument indicates when one of the data.frames contains GO information. If you choose to use this argument, makeOrgPackage() will post-process your GO data to 1) remove IDs that are too new and 2) create a second table to also represent the GOALL, EVIDENCEALL and ONTOLOGYALL fields for the select method etc. However to use the goTable argument, you have to follow a strict convention with the data. Such a data.frame must have three columns only and these must correspond to the gene id, GO id and evidence codes. These columns also have to be named as “GID”, “GO” and “EVIDENCE” Below is an example that parses an example file into three data.frame and that makes use of the goTable argument.

## Makes an organism package for Zebra Finch data.frames:
finchFile <- system.file("extdata","finch_info.txt",
finch <- read.table(finchFile,sep="\t")

## Now prepare some data.frames
fSym <- finch[,c(2,3,9)]
fSym <- fSym[fSym[,2]!="-",]
fSym <- fSym[fSym[,3]!="-",]
colnames(fSym) <- c("GID","SYMBOL","GENENAME")

fChr <- finch[,c(2,7)]
fChr <- fChr[fChr[,2]!="-",]
colnames(fChr) <- c("GID","CHROMOSOME")

finchGOFile <- system.file("extdata","GO_finch.txt",
fGO <- read.table(finchGOFile,sep="\t")
fGO <- fGO[fGO[,2]!="",]
fGO <- fGO[fGO[,3]!="",]
colnames(fGO) <- c("GID","GO","EVIDENCE")

## Then call the function
makeOrgPackage(gene_info=fSym, chromosome=fChr, go=fGO,
               maintainer="Some One <>",
               author="Some One <>",
               outputDir = ".",

## then you can call install.packages based on the return value
install.packages("./", repos=NULL)