# automatic RNA-Seq present/absent gene expression calls generation

#### 2021-10-27

BgeeCall is a collection of functions that uses Bgee expertise to create RNA-Seq gene expression present/absent calls.

The BgeeCall package allows to:

• Generate calls of presence/absence of expression at the gene level. You can generate these calls for any RNA-Seq samples as long as reference intergenic sequences have been generated by the Bgee team or by the Bgee community.
• Paralellize calls generation on a cluster with slurm queuing system.
• Collect stats across libraries, provided from different calls approaches.
• Merging calls from different libraries.

If you find a bug or have any issues with BgeeCall please write a bug report in our GitHub issues manager.

## How present/absent calls are generated

In Bgee RNA-Seq calls are generated using a threshold specific to each RNA-Seq library, calculated using reads mapped to reference intergenic regions. This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).

### Bgee database

Bgee is a database to retrieve and compare gene expression patterns in multiple animal species and produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It notably integrates RNA-Seq libraries for 29 species.

### Reference intergenic regions

Reference intergenic regions are defined in the Bgee RNA-Seq pipeline. Candidate intergenic regions are defined using gene annotation data. For each species, over all available libraries, reads are mapped to these intergenic regions with kallisto, as well as to genes. This “intergenic expression” is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression. Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes. This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic. For more information please refer to the Bgee RNA-Seq pipeline.

### Threshold of present/absent

BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team. Moreover BgeeCall allows to use these reference intergenic regions to automatically generate calls for your own RNA-Seq libraries as long as the species is integrated to Bgee.

By default BgeeCall calculate a pValue to define calls. By default genes are consider present if the pValue is lower or equal to 0.05. More information on this pValue and potential other approaches to generate the calls are available here

## Installation

In R:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BgeeCall")

## How to use the BgeeCall package

BgeeCall is highly tunable. Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions. BgeeCall needs kallisto to run. If you do not have kallisto installed you will find more information how to install it here

library(BgeeCall)

### Quick start

With the BgeeCall package it is easy to generate present/absent gene expression calls. Generation of the kallisto transcriptome index can require a lot of time. As the time needed for this step depend on the size of the transcriptome we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans). To generate these calls you will need :

• a transcriptome
• gene annotations

For this vignette we created a toy fastq file example based on the SRX099901 RNA-Seq library using the ShortRead R package

library("ShortRead")
sampler <- FastqSampler(file.path("absolute_path","/SRX099901/SRR350955.fastq.gz"), 48000)
set.seed(1); SRR350955 <- yield(sampler)
writeFastq(object = SRR350955,
file =file.path( "absolute_path","SRX099901_subset", "SRR350955_subset.fastq.gz"),
mode = "w", full = FALSE, compress = TRUE)

In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.

ah <- AnnotationHub::AnnotationHub()
ah_resources <- AnnotationHub::query(ah, c("Ensembl", "Caenorhabditis elegans", "84"))
annotation_object <- ah_resources[["AH50789"]]
# remove MtDNA not tag as C. elegans genome
annotation_object <- dropSeqlevels(annotation_object, "MtDNA", "coarse")
transcriptome_object <- rtracklayer::import.2bit(ah_resources[["AH50453"]])

Once you have access to transcriptome, gene annotations and your RNA-Seq library, an object of class UserMetadata has to be created.

# create an object of class UserMetadata and specify the species ID
user_BgeeCall <- new("UserMetadata", species_id = "6239")
# import annotation and transcriptome in the user_BgeeCall object
# it is possible to import them using an S4 object (GRanges, DNAStringSet) or a file (gtf, fasta)
user_BgeeCall <- setAnnotationFromObject(user_BgeeCall, annotation_object, "WBcel235_84")
user_BgeeCall <- setTranscriptomeFromObject(user_BgeeCall, transcriptome_object, "WBcel235")
# provide path to the directory of your RNA-Seq library
user_BgeeCall <- setRNASeqLibPath(user_BgeeCall,
system.file("extdata",
"SRX099901_subset",
package = "BgeeCall"))

Note that in BgeeCall is possible to specify the annotation source. The default source is ensembl, but gencode files can also be used, by specifying that in the attribute gtf_source in the class UserMetadata.

And that’s it… You can run the generation of your present/absent gene expression calls

calls_output <- generate_calls_workflow(userMetadata = user_BgeeCall)
#> Querying Bgee to get intergenic release information...
#> Note: importing abundance.h5 is typically faster than abundance.tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Note: importing abundance.h5 is typically faster than abundance.tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Generate present/absent expression calls using pValuecutoff

Each analyze generates 5 files and return path to each one of them.

• calls_tsv_path : path to main tsv file with TPM, count, length, biotype, type, and presence/absence of expression summarized at gene level (or at the transcript level if it was requested)
head(read.table(calls_output$calls_tsv_path, header = TRUE), n = 5) #> id abundance counts length biotype type zScore #> 1 WBGene00000001 27.10422 2 1556.20 protein_coding genic 1.6422092 #> 2 WBGene00000002 0.00000 0 1761.00 protein_coding genic NA #> 3 WBGene00000003 0.00000 0 1549.00 protein_coding genic NA #> 4 WBGene00000004 13.87559 1 1519.92 protein_coding genic 0.9143951 #> 5 WBGene00000005 0.00000 0 1487.00 protein_coding genic NA #> pValue call #> 1 0.05027333 absent #> 2 NA absent #> 3 NA absent #> 4 0.18025463 absent #> 5 NA absent • cutoff_info_file_path : path to tsv summary of the analyze containing the proportion of gene, protein coding and intergenic defined as expressed. It also contains the library ID and the present/absent TPM threshold read.table(calls_output$cutoff_info_file_path)
#>                             V1                 V2
#> 1                    libraryId   SRX099901_subset
#> 2                    cutoffTPM            27.1747
#> 3       proportionGenicPresent   17.8245677723828
#> 4           numberGenicPresent               3928
#> 5                  numberGenic              22037
#> 6      proportionCodingPresent   19.1128282877684
#> 7          numberPresentCoding               3908
#> 8                 numberCoding              20447
#> 9  proportionIntergenicPresent 0.0233808744447042
#> 10     numberIntergenicPresent                  1
#> 11            numberIntergenic               4277
#> 12                pValueCutoff               0.05
#> 13              meanIntergenic   5.98186903584813
#> 14                sdIntergenic   2.50920239407257
• abundance_tsv : path to tsv kallisto quant output file
head(read.table(calls_output$abundance_tsv, header = TRUE), n = 5) #> target_id length eff_length est_counts tpm #> 1 Y110A7A.10 1787 1556.20 2 27.0993 #> 2 F27C8.1 1940 1761.00 0 0.0000 #> 3 F07C3.7 1728 1549.00 0 0.0000 #> 4 F52H2.2 1739 1519.92 1 13.8730 #> 5 T13A10.10a 1734 1555.00 0 0.0000 • TPM_distribution_path : path to plot in pdf reprensting density distribution of TPM values for all sequences, protein coding sequences, and intergenic sequences. The grey line corresponds to TPM threshold used to generate present/absent calls. openPDF(calls_output$TPM_distribution_path)
• S4_slots_summary : path to tsv file containing a summary of values used for the most important slots of the three S4 classes (UserMetadata, KallistoMetadata, and BgeeMetadata).
read.table(calls_output\$S4_slots_summary, header = TRUE, sep = "\t")
#>                                     Slot.name
#> 13                                 output_dir
#>                                                                                         Slot.value
#> 1                                                                                         kallisto
#> 2                                                                                            FALSE
#> 3                                                                                            FALSE
#> 4                                                                                             0.05
#> 5                                                                                               50
#> 6                                                                                              0.1
#> 7                                                                                             6239
#> 8                                                                                               51
#> 9                             /tmp/RtmpRSl5bv/Rinste546954fcf0c8/BgeeCall/extdata/SRX099901_subset
#> 10                                                                                        WBcel235
#> 11                                                                                     WBcel235_84
#> 12                                                                                            TRUE
#> 13 /tmp/RtmpRSl5bv/Rinste546954fcf0c8/BgeeCall/extdata/intergenic_0.1/all_results/SRX099901_subset

Note that in the pValue approach (default method to generate present/absent gene expression calls) the calls file as well as the cutoff_info_file and the TPM_distribution files not contain extension name of the approach used. On the other hand, all the other approaches contain the approach name in the file names.

### Generate present/absent calls for more than one RNA-Seq library

You will potentialy be also interested to generate present/absent calls on different RNA-Seq libraries, potentially on different species, or using The main function generate_presence_absence() allows to generate present/absent calls from a UserMetadata object but also from a data frame or a tsv file depending on the arguments of the function you use. Please choose one of the three following arguments : - userMetadata : Allows to generate present/absent calls for one RNA-Seq library using one object of the class UserMetadata.
- userDataFrame : Provide a dataframe where each row correspond to one present/absent call generation. It allows to generate present/absent calls on different libraries, species, transcriptome, genome annotations, etc. - userFile : Similar to userDataFrame except that the information are stored in a tsv file. A template of this file called userMetadataTemplate.tsv is available at the root of the package.

Columns of the dataframe or the tsv file are :

• species_id : The NCBI ID of the species.
• run_ids : The runs of the RNA-Seq library you want to use for the generation of the calls. Allows to generate expression calls for a subset of the runs of one RNA-Seq library as described in Generate calls for a subset of RNA-Seq runs. If not interested by this option, leave the column empty.
• rnaseq_lib_path : Path to the directory containing all fastq files generated for this library. This directory can only contains single-end runs or paired-end runs.
• transcriptome_path : path to the transcriptome file.
• annotation_path : path to the genome annotation file. Works with GTF of GFF3 files.
• working_path : path to the working directory where results will be stored. Using the same working directory for different RNA-Seq libraries of the same species will allow to reuse previously generated data like the custom transcriptome index (generated from both transcriptome and reference intergenic sequences). By default the working path is defined by the getwd() function and correspond to the working directory of your R session. If not interested by this option, leave the column empty.
• output_directory : Both species results and RNA-Seq libraries results are by default stored at the same place using the value of the working_path column. However, this column allows you to define a different output_directory for RNA-Seq results. For instance it allows you to save calls information directly in the RNA-Seq directory. If not interested by this option, leave the column empty.

Once the file has been fill in expression calls can be generated with :

calls_output <- generate_calls_workflow(userFile = "path_to_your_file.tsv")

### Parallized generation of present/absent calls on a cluster

BgeeCall already implement everything you need to generate calls on a cluster if it uses slurm as queuing system. The same TSV file as described in the previous section will be used as input. In addition to tuning options available when running BgeeCall on your computer, It is possible to modify how slurm jobs are submitted. More information available in section Modify slurm options In order to optimize parallelization calls will be generated in 2 steps.
\itemize { Generate data at species level (e.g trancriptome with intergenic sequences, kallisto indexes)

Generate expression calls for each RNA-Seq libraries

# generate kallisto indexes
generate_slurm_indexes(userFile = "path_to_your_file.tsv")
# generate expression calls
generate_slurm_calls(userFile = "path_to_your_file.tsv")

### Reference intergenic sequences

#### Releases of reference intergenic sequences

Different releases of reference intergenic sequences are available. It is possible to list all these releases :

list_intergenic_release()
#>     release releaseDate                             FTPURL
#> 1       0.1  2018-12-21 ftp://ftp.bgee.org/intergenic/0.1/
#> 2       1.0  2021-06-11 ftp://ftp.bgee.org/intergenic/1.0/
#> 3       0.2  2019-02-07 ftp://ftp.bgee.org/intergenic/0.2/
#> 4 community  2019-07-22
#> 5    custom  2019-07-22
#>                                                    referenceIntergenicFastaURL
#> 1 ftp://ftp.bgee.org/intergenic/0.1/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 2 ftp://ftp.bgee.org/intergenic/1.0/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 3 ftp://ftp.bgee.org/intergenic/0.2/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 4
#> 5
#>   minimumVersionBgeeCall
#> 1                  0.9.9
#> 2                  0.9.9
#> 3                  0.9.9
#> 4                  1.1.0
#> 5                  1.1.0
#>                                                                                                                                                                                                     description
#> 1                                                                                                                                                                  intergenic regions used to generate Bgee 14.
#> 2                                                                                                                                                                  intergenic regions used to generate Bgee 15.
#> 3                                                                          cleaned intergenic sequences based on release 0.1 (remove blocks of Ns longer than 100 and sequences containing more than 5% of Ns).
#> 4                                                                           Release allowing to access to all reference intergenic sequences generated by the community and not present in Bgee for the moment.
#> 5 Release allowing to use your own FASTA reference intergenic sequences. When this release is selected BgeeCall will use the sequences at UserMetadata@custom_intergenic_path to generate present/absent calls.
#>                                                                            messageToUsers
#> 1
#> 2
#> 3                         be careful, this intergenic release has not been tested by Bgee
#> 4 These reference intergenic sequences have not been generated by Bgee. Use with caution.
#> 5                              You decided to use your own reference intergenic sequences

It is then possible to choose one specific release to create a BgeeMetadata object. Always use the setter method setIntergenicRelease() when changing the release of an already existing BgeeMetadata object.

# create BgeeMetadata object and define one reference intergenic release
bgee <- new("BgeeMetadata", intergenic_release = "0.1")
#> Querying Bgee to get intergenic release information...
bgee <- setIntergenicRelease(bgee, "0.2")
#> IMPORTANT : be careful, this intergenic release has not been tested by Bgee

By default the reference intergenic release used when a BgeeMetadata object is created is the last stable one created by the Bgee team.

#### Core reference intergenic from Bgee

Core reference intergenic releases are created by the Bgee team when a lot of new RNA-Seq libraries have been manually curated for already existing species and/or for new species. These releases are the only ones with a release number (e.g “0.1”). Each of these releases contains reference intergenic sequences for a list of species. Bgee reference intergenic sequences have been generated using Bgee team expertise. The RNA-Seq libraries were manually curated as healthy and wild type. Quality Control have been done along all steps of generation of these sequences. Reference intergenic sequences have been selected from all potential intergenic regions (see Bgee pipeline). BgeeCall allows to generate gene expression call from Bgee reference intergenic sequences for any RNA-Seq libraries as long as these sequences have been generated by the Bgee team. A tsv file containing all species available for current release of reference intergenic is available here. This file also contains a column describing the number of RNA-Seq libraries used to generated the reference intergenic sequences of each species. It is also possible to list in R all species for which Bgee reference intergenic sequences have been created :

list_bgee_ref_intergenic_species(myBgeeMetadata = bgee)
#>    speciesId              speciesName numberOfLibraries   genomeVersion
#> 1       9606             Homo sapiens              5026       GRCh38.p5
#> 2      10090             Mus musculus               133       GRCm38.p4
#> 3       9544           Macaca mulatta                90         MMUL1.0
#> 4       7955              Danio rerio                67          GRCz10
#> 5       8364       Xenopus tropicalis                66          JGI4.2
#> 6       6239   Caenorhabditis elegans                50        WBcel235
#> 7       9031            Gallus gallus                45         Galgal4
#> 8      10116        Rattus norvegicus                36        Rnor_6.0
#> 9       9913               Bos taurus                33          UMD3.1
#> 10     13616    Monodelphis domestica                19         monDom5
#> 11      9258 Ornithorhynchus anatinus                17           OANA5
#> 12      7240      Drosophila simulans                17 GCA_000259055.1
#> 13      9598          Pan troglodytes                15      CHIMP2.1.4
#> 14      7237 Drosophila pseudoobscura                14 GCA_000001765.2
#> 15      7227  Drosophila melanogaster                14           BDGP6
#> 16      9593          Gorilla gorilla                13       gorGor3.1
#> 17      9597             Pan paniscus                12      CHIMP2.1.4
#> 18      9823               Sus scrofa                10     Sscrofa10.2
#> 19     10141          Cavia porcellus                 9 Felis_catus_6.2
#> 20      9685              Felis catus                 9         cavPor3
#> 21      7230    Drosophila mojavensis                 8         EquCab2
#> 22      9796           Equus caballus                 8 GCA_000005175.1
#> 23      9986    Oryctolagus cuniculus                 6         eriEur1
#> 24      9615   Canis lupus familiaris                 6       CanFam3.1
#> 25      9365      Erinaceus europaeus                 6       OryCun2.0
#> 26      7244       Drosophila virilis                 4 GCA_000005245.1
#> 27     28377      Anolis carolinensis                 4       AnoCar2.0
#> 28      7217     Drosophila ananassae                 4 GCA_000005975.1
#> 29      7245        Drosophila yakuba                 4 GCA_000005115.1

#### Community reference intergenic

If you want to use BgeeCall on a species for which Bgee does not provide reference intergenic sequences you have the possibility to create them by yourself and share them with the Bgee community by following all steps of this tutorial. Do not forget that the number of RNA-Seq libraries is a key point to the generation of precise reference intergenic sequences. It is possible to list in R all species for which reference intergenic sequences have been created by the community using the following code

list_community_ref_intergenic_species()
#>   speciesId numberOfLibraries annotationVersion genomeVersion kallistoVersion
#> 1     10036                15         MesAur1.0     MesAur1.0          0.46.0
#> 2     13686               243            Si_gnG        Si_gnG          0.44.0
#>                                                                                      url
#> 1 https://zenodo.org/api/files/f46c7de0-d9a5-4ffd-a30e-4b08121ba446/ref_intergenic.fa.gz
#> 2 https://zenodo.org/api/files/5492ff2f-91a3-4101-8d67-78b8f8625cc6/ref_intergenic.fa.gz

If reference intergenic sequences of the species you are interested in are available only from the community release it is then possible to use this release to generate your present/absent calls

# create a BgeeMetadata object using the community release
bgee <- new("BgeeMetadata", intergenic_release = "community")
calls_output <- generate_calls_workflow(bgeeMetadata = bgee, userMetadata = user_BgeeCall)

If you generated your own reference intergenic sequences follwowing this tuorial but did not share them for the moment (do not forget to do it…), it is also possible to use BgeeCall with a file containing the sequences. In this case you need to select the custom release and provide the path to the file containing reference intergenic sequences :

bgee <- new("BgeeMetadata", intergenic_release = "custom")
user_BgeeCall@custom_intergenic_path = "path/to/custom/ref_intergenic.fa.gz"
calls_output <- generate_calls_workflow(bgeeMetadata = bgee, userMetadata = user_BgeeCall)

### Generate present/absent calls at transcript level (beta version)

kallisto generates TPMs at the transcript level. In the Bgee pipeline we summarize this expression at the gene level to calculate our present/absent calls. In BgeeCall it is now possible to generate present/absent calls at the transcript level. Be careful when using this feature as it has not been tested for the moment. To generate such calls you only have to create one object of the class KallistoMetadata and edit the value of one attribute

kallisto <- new("KallistoMetadata", txOut = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

### Tune how to use kallisto

By default BgeeCall suppose that kallisto is installed. If kallisto is not installed on your computer you can either :

• let BgeeCall automatically download the version 0.45 of kallisto. BgeeCall will use it to quantify abundance of transcripts. It will only be used by this package and will have no impact on your potential already existing version of kallisto.
kallisto <- new("KallistoMetadata", download_kallisto = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

#### Edit kallisto quant attributes

By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:

• single end : “-t 1 –single -l 180 -s 30 –bias”
• paired end : “-t 1 –bias”

It is possible to modify them and use your favourite kallisto parameters

kallisto <- new("KallistoMetadata", single_end_parameters = "-t 3 --single -l 150 -s 30", pair_end_parameters = "-t 2 -b --seed 36")
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

#### Choose between two kmer size

By default 2 indexes with 2 different kmer sizes can be used by BgeeCall The default kmer size of kallisto (31) is used for libraries with reads length equal or larger than 50 bp. A kmer size of 15 is used for libraries with reads length smaller than 50 bp. We decided not to allow to tune kmers size because the generation of the index is time consuming and index generation takes even more time with small kmers size (< 15bp). However it is possible to modify the threshold of read length allowing to choose between default and small kmer size.

# libraries with reads smaller than 70bp will use the index with kmer size = 15
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)

Note that for libraries with read length unspecified the default kmer size(31) will be used.

### Generate calls for a subset of RNA-Seq runs

By default gene expression calls are generated using all runs of the RNA-Seq library. It is possible to select only a subset of these runs.

# RNA-Seq run SRR350955_subsetof from the RNA-Seq library will be used to generate the calls
user_BgeeCall <- setRunIds(user_BgeeCall, c("SRR350955_subset"))
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)

When run IDs are selected, the name output directory combine the library ID and all selected run IDs. In our example the expression calls will be stored in the directory SRX099901_SRR350955_subset.

### Modify present/absent threshold

#### Default pValue approach

By default BgeeCall generate calls using a pValue approach. In order to generate the pValue per gene id we compute the numerical measure, Z-score, that describes the value relationship to the mean. The Z-score measure is done in terms of standards deviations from the referent mean, in this case using a set of reference intergenic regions, as specified in the formula below:

$zScore = \frac {log2(tpmValue) - mean(log2(IntergenicTpmValues))}{sd(log2(IntergenicTpmValues))}$ From the Z-score value, for each gene id, we compute a distribution of pValues using the following formula:

$pValue = pnorm(zscore, lower.tail = FALSE)$ Genes are consider present if the pValue is lower or equal to 0.05.

By default all genes with an abundance higher than 0 (i.e having at least one read mapped to a transcript) and with a pValue lower or equal to 0.05 are considered as present. Other genes are called absent. It is possible to modify the pValue cutoff. Be careful when editing this value as it has a big impact on your present absent.

kallisto <- new("KallistoMetadata", cutoff = 0.1)

#### Intergenic threshold approach

Expression calls can also be generated using a threshold of intergenic sequences called present. This approach was used to generate Bgee expression calls until release 14 with the formula :

$\frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05$ It is possible to change the cutoff value of this ratio.

# use intergenic approach with default cutoff ratio 0.05
kallisto <- new("KallistoMetadata", cutoff_type = "intergenic")
# use intergenic approach with cutoff ratio 0.01
kallisto <- new("KallistoMetadata", cutoff_type = "intergenic", cutoff = 0.01)

#### qValue threshold approach

The expression calls can also be generated using the qValue approach. In this approach we perform a linear interpolation for each density region, genic and reference intergenic, followed by the numerical integration. Then for each unique abundance value (TPM) we integrate and scale across the region. After that, the correspondent qValue for each gene id is computed following the formula:

$qValue = \frac {intergenic}{intergenic + genic}$ ### Collect statistics

In BgeeCall the user are able to collect all the statistics provided by the different approaches to call expressed genes at individual library. To provide this informative table, all cutoff_info_file_ from each individual library and from each individual approach are used to retrieve the correspondent information.

summaryStats <- get_summary_stats(userFile = "path/to/file.tsv", outDir = "path/to/output_directory")

### Run BgeeCall in quiet mode

By default BgeeCall write output messages for all parts of the workflow. It is possible not to write any messages by changing the value of the slot verbose in the UserMetadata object. By default this value is set to true but it is possible to change it to false with this line :

user_BgeeCall@verbose <- FALSE

### Do not rerun parts of the pipeline

Generation of present/absent expression calls is done in several steps. It is possible to force overwritting already existing intermediary files : - overwrite_index : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the index generation step is skiped if one index already exists. If TRUE, the kallisto index will be generated even if one index already exists. - overwrite_quant : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the kallisto quantification step is skiped if a quantification file already exists. If TRUE, the kallisto quantification step will be run even if a quantification file already exists. - overwrite_calls : slot of the object KallistoMetadata. The value has to be a logical. If FALSE, the generation of present/absent calls is skiped if an index already exists. If TRUE (default), the generation of present/absent calls will be run even if calls were already generated.

### Ignore transcript version

It can happen that calls are generated using transcriptome or annotations containing transcript version (number after a dot in transcript IDs e.g ENSMUST00000082908.2) and annotations or transcriptome without transcript version. This is problematic when using tximport to transform abundance at transcript level to abundance at gene level and result in an error.

Error in .local(object, ...) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Example IDs (file): [, ...]
Example IDs (tx2gene): [ENSMUST00000193812, ENSMUST00000082908, ENSMUST00000192857, ...]
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
Calls: do.call ... tximport -> summarizeToGene -> summarizeToGene -> .local
Execution halted

To solve this error tximport implemented an option called ignoreTxVersion that remove the transcript version from both transcriptome and annotations. It is possible to use this option by modifying the value of the slot ignoreTxVersion (default FALSE) of the S4 class KallistoMetadata.

kallisto <- new("KallistoMetadata", ignoreTxVersion = TRUE)

### Generate calls with a simple arborescence of directories

By default the arborescence of directories created by BgeeCall is as simple as possible. the results will be created using the path working_path/intergenic_release/all_results/libraryId. Generating present/absent gene expression calls for the same RNA-Seq library using different transcriptome or annotation versions using this arborescence will overwrite previous results. The UserMetadata class has an attribute simple_arborescence that is TRUE by default. If FALSE, a complexe arborescence of directories containing the name of the annotation and transcriptome files will be created. This complex arborescence will then allow to generate present/absent calls for the same library using different version of transcriptome or annotaiton.

user_BgeeCall <- setSimpleArborescence(userObject = user_BgeeCall, simpleArborescence = FALSE)
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)

### Change directory where calls are saved

By default directories used to save present/absent calls are subdirectories of UserMetadata@working_path. However it is possible to select the directory where you want the calls to be generated.

user_BgeeCall <- setOutputDir(user_BgeeCall, "path/to/calls/for/this/library/")

This output directory will only contains results generated at the RNA-Seq library level. All data generated at species level are still stored using the UserMetadata@working_path. They can then still be reused to generate calls from other libraries of the same species.

### Modify slurm options

Two functions are available to run BgeeCall on a slurm queuing system. Parameters described below are available for both of them.

#### Number of jobs

The full idea of using a cluster is to parallelize your jobs. By default 10 jobs are run at the same time. It is possible to modify this number with the parameter nodes.

# run 50 jobs in parallel
generate_slurm_indexes(userFile = "path/to/file.tsv", nodes = 50)

#### Do not submit the jobs

In order to be able to check files automatically generated to run the jobs it is possible to generate these files without submiting your jobs. More information on created files are available on the vignette of the (https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm.html)[rslurm package].

# create temporary files but do not submit the jobs
generate_slurm_indexes(userFile = "path/to/file.tsv", submit = FALSE)

#### Modify slurm options

A bash scirpt is automatically created to run the jobs. This script contains default slurm options (array, cpus-per-task, job-name, output). All other slurm options recognized by the sbatch command can be updated b creating a named list where name correspond to long name of options (e.g do not use ‘p’ but ‘partition’).

# add slurm options to the sbatch script
slurm_options_index <- list(account = "account", time = "2:00:00", partition = "partition", mem = "30G")
generate_slurm_indexes(userFile = "path/to/file.tsv", slurm_options = slurm_options_index)

In some cluster programs are not loaded by default. The modules parameter allows to load them by adding one line in the sbatch script. This option has been implemented to add modules but could potentially be used to add any custom line of code in the sbatch script.

# load R 3.6.1 and kallisto in a cluster environment where software has to loaded manually
generate_slurm_indexes(userFile = "path/to/file.tsv", modules = modules)

#### Modify BgeeCall objects

By default except for columns present in the tsv file all other slots of the 3 BgeeCall classes will use default values. In order to tune these parameters it is possible to create the objects and pass them to the slurm functions. When generating these objects it is mandatory to keep the same name as in the example below.

# create BgeeCall objects and use them to generate indexes
generate_slurm_indexes(userFile = "path/to/file.tsv", kallistoMetadata = kallistoMetadata, bgeeMetadata = bgeeMetadata, userMetadata = userMetadata)

## Merging multiple libraries

In the BgeeCall package we have the possibility to perform calls of expression by merging/combine multiple libraries that belongs to a particular condition. In order to use this functionality, the calls at individual library need to be done by using the p-value or q-value approach, since this approaches provide for each $$gene_i$$ a quantitative metric.

If the calls of expression genes, at individual library, was done using the pValue approach, then the merging will be done by using the Benjamini & Hochberg method.

For this method a set of pValues are collected, for each $$gene_i$$ across $$n$$ libraries, and then the referent p.adjusted values are computed. To call expressed genes in the merging, we verify if one of the p.adjusted values is lower then the cutoff (threshold value provided by the user), if yes, the $$gene_i$$ is classified as present for the condition, otherwise is absent.

mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")

On the other side, if the approach used to call expressed genes, at individual library, was the qValue, then the merging will be done by using the inverse fdr correction, this means by applying the correspondent formula:

$FDR_{inverse} = (1-((1-q)^{(\frac {1}{n})}))$ where $$q$$ is the desired cut-off (threshold value provided by the user) and $$n$$ is the number of libraries.

# merging libraries
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")

To call expressed genes in the merging condition, we verify if one of the libraries have a q-Value lower then the FDR_inverse, if yes, the $$gene_i$$ is classified as present, otherwise is classified as absent.

### Arguments and user file to perform the merging

In the merging_libraries() function, different parameters need to be passed to the referent arguments: userFile, approach, condition, cutoff or outDir.

In the userFile argument a path to a text file containing all information about the target libraries to be merged. This file should contain a fundamental column called: specied_id in order to run the merging_libraries() function. This column should not be empty and also an important request is that you should have at least 2 libraries to be processed, since this is a merging process.

If you want to merge/combine libraries for a more detailed condition you need to add the correspondent extra columns to your user file. Note that the argument condition recognize the following parameters: specied_id, anatEntity, devStage, sex and strain. This means that your table columns also should have the same ID. The order of the columns are not relevant when the user file is structured, as well as, the order of the parameters passed to the condition argument.

Is always recommended to use “-” instead of empty rows when some information is missing, as represented in the BgeeCall/inst/userMetadataTemplate_merging.tsv

The argument approach allow to specify the method that will be used to make the calls of expression in the merging libraries, and as described before, the condition argument will allow to specify the particular condition of interest, the condition argument allow to target the correspondent libraries where the merging will occur.

More restricted cut-off can be applied to call expressed genes in the merging libraries for a target condition, for this the cutoff argument can be modified, otherwise by default is 0.05.


# merging libraries where p-values were calculated
mergingLibraries_pValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = c("species_id", "anatEntity", "strain"), cutoff = 0.005, outDir = "path/to/output_directory/")

# merging libraries where q-values were calculated
mergingLibraries_qValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = c("species_id", "devStage" ,"anatEntity", "strain"), cutoff = 0.001, outDir = "path/to/output_directory/")

#Session Info

sessionInfo()
#> R Under development (unstable) (2021-10-19 r81077)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_GB              LC_COLLATE=C
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods
#> [8] base
#>
#> other attached packages:
#> [1] rtracklayer_1.55.0   GenomicRanges_1.47.1 GenomeInfoDb_1.31.0
#> [4] IRanges_2.29.0       S4Vectors_0.33.0     BiocGenerics_0.41.0
#> [7] BgeeCall_1.11.0
#>
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7                  matrixStats_0.61.0
#>   [3] bit64_4.0.5                   insight_0.14.5
#>   [5] filelock_1.0.2                progress_1.2.2
#>   [7] httr_1.4.2                    tools_4.2.0
#>   [9] bslib_0.3.1                   utf8_1.2.2
#>  [11] R6_2.5.1                      sjlabelled_1.1.8
#>  [13] DBI_1.1.1                     rhdf5filters_1.7.0
#>  [15] withr_2.4.2                   tidyselect_1.1.1
#>  [17] prettyunits_1.1.1             bit_4.0.4
#>  [19] curl_4.3.2                    compiler_4.2.0
#>  [21] Biobase_2.55.0                xml2_1.3.2
#>  [23] DelayedArray_0.21.0           sass_0.4.0
#>  [27] stringr_1.4.0                 digest_0.6.28
#>  [29] Rsamtools_2.11.0              rmarkdown_2.11
#>  [31] XVector_0.35.0                pkgconfig_2.0.3
#>  [33] htmltools_0.5.2               MatrixGenerics_1.7.0
#>  [35] BSgenome_1.63.0               dbplyr_2.1.1
#>  [37] fastmap_1.1.0                 rlang_0.4.12
#>  [39] RSQLite_2.2.8                 shiny_1.7.1
#>  [41] jquerylib_0.1.4               BiocIO_1.5.0
#>  [43] generics_0.1.1                jsonlite_1.7.2
#>  [45] vroom_1.5.5                   BiocParallel_1.29.0
#>  [47] dplyr_1.0.7                   RCurl_1.98-1.5
#>  [49] magrittr_2.0.1                GenomeInfoDbData_1.2.7
#>  [51] Matrix_1.3-4                  Rcpp_1.0.7
#>  [53] Rhdf5lib_1.17.0               fansi_0.5.0
#>  [55] lifecycle_1.0.1               stringi_1.7.5
#>  [57] yaml_2.2.1                    SummarizedExperiment_1.25.0
#>  [59] zlibbioc_1.41.0               rhdf5_2.39.0
#>  [61] BiocFileCache_2.3.0           AnnotationHub_3.3.0
#>  [63] grid_4.2.0                    rslurm_0.6.0
#>  [65] blob_1.2.2                    promises_1.2.0.1
#>  [67] parallel_4.2.0                sjmisc_2.8.7
#>  [69] crayon_1.4.1                  lattice_0.20-45
#>  [71] Biostrings_2.63.0             GenomicFeatures_1.47.1
#>  [73] hms_1.1.1                     KEGGREST_1.35.0
#>  [75] knitr_1.36                    pillar_1.6.4
#>  [77] rjson_0.2.20                  biomaRt_2.51.0
#>  [79] XML_3.99-0.8                  glue_1.4.2
#>  [81] BiocVersion_3.15.0            evaluate_0.14
#>  [83] data.table_1.14.2             BiocManager_1.30.16
#>  [85] tzdb_0.2.0                    httpuv_1.6.3
#>  [87] png_0.1-7                     vctrs_0.3.8
#>  [89] purrr_0.3.4                   assertthat_0.2.1
#>  [91] cachem_1.0.6                  xfun_0.27
#>  [93] mime_0.12                     xtable_1.8-4
#>  [95] restfulr_0.0.13               later_1.3.0
#>  [97] tibble_3.1.5                  GenomicAlignments_1.31.0
#>  [99] AnnotationDbi_1.57.0          memoise_2.0.0
#> [101] tximport_1.23.0               ellipsis_0.3.2
#> [103] interactiveDisplayBase_1.33.0