# automatic RNA-Seq present/absent gene expression calls generation

#### 2021-05-19

BgeeCall is a collection of functions that uses Bgee expertise to create RNA-Seq gene expression present/absent calls.

The BgeeCall package allows to:

• Generate calls of presence/absence of expression at the gene level. You can generate these calls for any RNA-Seq samples as long as reference intergenic sequences have been generated by the Bgee team or by the Bgee community.
• Paralellize calls generation on a cluster with slurm queuing system.
• Collect stats across libraries, provided from different calls approaches.
• Merging calls from different libraries.

If you find a bug or have any issues with BgeeCall please write a bug report in our GitHub issues manager.

## How present/absent calls are generated

In Bgee RNA-Seq calls are generated using a threshold specific to each RNA-Seq library, calculated using reads mapped to reference intergenic regions. This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).

### Bgee database

Bgee is a database to retrieve and compare gene expression patterns in multiple animal species and produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It notably integrates RNA-Seq libraries for 29 species.

### Reference intergenic regions

Reference intergenic regions are defined in the Bgee RNA-Seq pipeline. Candidate intergenic regions are defined using gene annotation data. For each species, over all available libraries, reads are mapped to these intergenic regions with kallisto, as well as to genes. This “intergenic expression” is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression. Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes. This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic. For more information please refer to the Bgee RNA-Seq pipeline.

### Threshold of present/absent

BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team. Moreover BgeeCall allows to use these reference intergenic regions to automatically generate calls for your own RNA-Seq libraries as long as the species is integrated to Bgee.

By default BgeeCall calculate a pValue to define calls. By default genes are consider present if the pValue is lower or equal to 0.05. More information on this pValue and potential other approaches to generate the calls are available here

## Installation

In R:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BgeeCall")

## How to use the BgeeCall package

BgeeCall is highly tunable. Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions. BgeeCall needs kallisto to run. If you do not have kallisto installed you will find more information how to install it here

### Quick start

With the BgeeCall package it is easy to generate present/absent gene expression calls. Generation of the kallisto transcriptome index can require a lot of time. As the time needed for this step depend on the size of the transcriptome we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans). To generate these calls you will need :

• a transcriptome
• gene annotations

For this vignette we created a toy fastq file example based on the SRX099901 RNA-Seq library using the ShortRead R package

In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.

Note that in BgeeCall is possible to specify the annotation source. The default source is ensembl, but gencode files can also be used, by specifying that in the attribute gtf_source in the class UserMetadata.

And that’s it… You can run the generation of your present/absent gene expression calls

#> Querying Bgee to get intergenic release information...
#> Note: importing abundance.h5 is typically faster than abundance.tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Note: importing abundance.h5 is typically faster than abundance.tsv
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Generate present/absent expression calls using pValuecutoff

Each analyze generates 5 files and return path to each one of them.

• calls_tsv_path : path to main tsv file with TPM, count, length, biotype, type, and presence/absence of expression summarized at gene level (or at the transcript level if it was requested)
• cutoff_info_file_path : path to tsv summary of the analyze containing the proportion of gene, protein coding and intergenic defined as expressed. It also contains the library ID and the present/absent TPM threshold
• abundance_tsv : path to tsv kallisto quant output file
• TPM_distribution_path : path to plot in pdf reprensting density distribution of TPM values for all sequences, protein coding sequences, and intergenic sequences. The grey line corresponds to TPM threshold used to generate present/absent calls.
• S4_slots_summary : path to tsv file containing a summary of values used for the most important slots of the three S4 classes (UserMetadata, KallistoMetadata, and BgeeMetadata).

Note that in the pValue approach (default method to generate present/absent gene expression calls) the calls file as well as the cutoff_info_file and the TPM_distribution files not contain extension name of the approach used. On the other hand, all the other approaches contain the approach name in the file names.

### Generate present/absent calls for more than one RNA-Seq library

You will potentialy be also interested to generate present/absent calls on different RNA-Seq libraries, potentially on different species, or using The main function generate_presence_absence() allows to generate present/absent calls from a UserMetadata object but also from a data frame or a tsv file depending on the arguments of the function you use. Please choose one of the three following arguments : - userMetadata : Allows to generate present/absent calls for one RNA-Seq library using one object of the class UserMetadata.
- userDataFrame : Provide a dataframe where each row correspond to one present/absent call generation. It allows to generate present/absent calls on different libraries, species, transcriptome, genome annotations, etc. - userFile : Similar to userDataFrame except that the information are stored in a tsv file. A template of this file called userMetadataTemplate.tsv is available at the root of the package.

Columns of the dataframe or the tsv file are :

• species_id : The NCBI ID of the species.
• run_ids : The runs of the RNA-Seq library you want to use for the generation of the calls. Allows to generate expression calls for a subset of the runs of one RNA-Seq library as described in Generate calls for a subset of RNA-Seq runs. If not interested by this option, leave the column empty.
• rnaseq_lib_path : Path to the directory containing all fastq files generated for this library. This directory can only contains single-end runs or paired-end runs.
• transcriptome_path : path to the transcriptome file.
• annotation_path : path to the genome annotation file. Works with GTF of GFF3 files.
• working_path : path to the working directory where results will be stored. Using the same working directory for different RNA-Seq libraries of the same species will allow to reuse previously generated data like the custom transcriptome index (generated from both transcriptome and reference intergenic sequences). By default the working path is defined by the getwd() function and correspond to the working directory of your R session. If not interested by this option, leave the column empty.
• output_directory : Both species results and RNA-Seq libraries results are by default stored at the same place using the value of the working_path column. However, this column allows you to define a different output_directory for RNA-Seq results. For instance it allows you to save calls information directly in the RNA-Seq directory. If not interested by this option, leave the column empty.

Once the file has been fill in expression calls can be generated with :

### Parallized generation of present/absent calls on a cluster

BgeeCall already implement everything you need to generate calls on a cluster if it uses slurm as queuing system. The same TSV file as described in the previous section will be used as input. In addition to tuning options available when running BgeeCall on your computer, It is possible to modify how slurm jobs are submitted. More information available in section Modify slurm options In order to optimize parallelization calls will be generated in 2 steps.
\itemize { Generate data at species level (e.g trancriptome with intergenic sequences, kallisto indexes)

Generate expression calls for each RNA-Seq libraries

### Reference intergenic sequences

#### Releases of reference intergenic sequences

Different releases of reference intergenic sequences are available. It is possible to list all these releases :

It is then possible to choose one specific release to create a BgeeMetadata object. Always use the setter method setIntergenicRelease() when changing the release of an already existing BgeeMetadata object.

By default the reference intergenic release used when a BgeeMetadata object is created is the last stable one created by the Bgee team.

#### Core reference intergenic from Bgee

Core reference intergenic releases are created by the Bgee team when a lot of new RNA-Seq libraries have been manually curated for already existing species and/or for new species. These releases are the only ones with a release number (e.g “0.1”). Each of these releases contains reference intergenic sequences for a list of species. Bgee reference intergenic sequences have been generated using Bgee team expertise. The RNA-Seq libraries were manually curated as healthy and wild type. Quality Control have been done along all steps of generation of these sequences. Reference intergenic sequences have been selected from all potential intergenic regions (see Bgee pipeline). BgeeCall allows to generate gene expression call from Bgee reference intergenic sequences for any RNA-Seq libraries as long as these sequences have been generated by the Bgee team. A tsv file containing all species available for current release of reference intergenic is available here. This file also contains a column describing the number of RNA-Seq libraries used to generated the reference intergenic sequences of each species. It is also possible to list in R all species for which Bgee reference intergenic sequences have been created :

#### Community reference intergenic

If you want to use BgeeCall on a species for which Bgee does not provide reference intergenic sequences you have the possibility to create them by yourself and share them with the Bgee community by following all steps of this tutorial. Do not forget that the number of RNA-Seq libraries is a key point to the generation of precise reference intergenic sequences. It is possible to list in R all species for which reference intergenic sequences have been created by the community using the following code

If reference intergenic sequences of the species you are interested in are available only from the community release it is then possible to use this release to generate your present/absent calls

If you generated your own reference intergenic sequences follwowing this tuorial but did not share them for the moment (do not forget to do it…), it is also possible to use BgeeCall with a file containing the sequences. In this case you need to select the custom release and provide the path to the file containing reference intergenic sequences :

### Generate present/absent calls at transcript level (beta version)

kallisto generates TPMs at the transcript level. In the Bgee pipeline we summarize this expression at the gene level to calculate our present/absent calls. In BgeeCall it is now possible to generate present/absent calls at the transcript level. Be careful when using this feature as it has not been tested for the moment. To generate such calls you only have to create one object of the class KallistoMetadata and edit the value of one attribute

### Tune how to use kallisto

By default BgeeCall suppose that kallisto is installed. If kallisto is not installed on your computer you can either :

• let BgeeCall automatically download the version 0.45 of kallisto. BgeeCall will use it to quantify abundance of transcripts. It will only be used by this package and will have no impact on your potential already existing version of kallisto.

#### Edit kallisto quant attributes

By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:

• single end : “-t 1 –single -l 180 -s 30 –bias”
• paired end : “-t 1 –bias”

It is possible to modify them and use your favourite kallisto parameters

#### Choose between two kmer size

By default 2 indexes with 2 different kmer sizes can be used by BgeeCall The default kmer size of kallisto (31) is used for libraries with reads length equal or larger than 50 bp. A kmer size of 15 is used for libraries with reads length smaller than 50 bp. We decided not to allow to tune kmers size because the generation of the index is time consuming and index generation takes even more time with small kmers size (< 15bp). However it is possible to modify the threshold of read length allowing to choose between default and small kmer size.

Note that for libraries with read length unspecified the default kmer size(31) will be used.

### Generate calls for a subset of RNA-Seq runs

By default gene expression calls are generated using all runs of the RNA-Seq library. It is possible to select only a subset of these runs.

When run IDs are selected, the name output directory combine the library ID and all selected run IDs. In our example the expression calls will be stored in the directory SRX099901_SRR350955_subset.

### Modify present/absent threshold

#### Default pValue approach

By default BgeeCall generate calls using a pValue approach. In order to generate the pValue per gene id we compute the numerical measure, Z-score, that describes the value relationship to the mean. The Z-score measure is done in terms of standards deviations from the referent mean, in this case using a set of reference intergenic regions, as specified in the formula below:

$zScore = \frac {log2(tpmValue) - mean(log2(IntergenicTpmValues))}{sd(log2(IntergenicTpmValues))}$ From the Z-score value, for each gene id, we compute a distribution of pValues using the following formula:

$pValue = pnorm(zscore, lower.tail = FALSE)$ Genes are consider present if the pValue is lower or equal to 0.05.

By default all genes with an abundance higher than 0 (i.e having at least one read mapped to a transcript) and with a pValue lower or equal to 0.05 are considered as present. Other genes are called absent. It is possible to modify the pValue cutoff. Be careful when editing this value as it has a big impact on your present absent.

#### Intergenic threshold approach

Expression calls can also be generated using a threshold of intergenic sequences called present. This approach was used to generate Bgee expression calls until release 14 with the formula :

$\frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05$ It is possible to change the cutoff value of this ratio.

#### qValue threshold approach

The expression calls can also be generated using the qValue approach. In this approach we perform a linear interpolation for each density region, genic and reference intergenic, followed by the numerical integration. Then for each unique abundance value (TPM) we integrate and scale across the region. After that, the correspondent qValue for each gene id is computed following the formula:

$qValue = \frac {intergenic}{intergenic + genic}$ ### Collect statistics

In BgeeCall the user are able to collect all the statistics provided by the different approaches to call expressed genes at individual library. To provide this informative table, all cutoff_info_file_ from each individual library and from each individual approach are used to retrieve the correspondent information.

### Run BgeeCall in quiet mode

By default BgeeCall write output messages for all parts of the workflow. It is possible not to write any messages by changing the value of the slot verbose in the UserMetadata object. By default this value is set to true but it is possible to change it to false with this line :

### Do not rerun parts of the pipeline

Generation of present/absent expression calls is done in several steps. It is possible to force overwritting already existing intermediary files : - overwrite_index : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the index generation step is skiped if one index already exists. If TRUE, the kallisto index will be generated even if one index already exists. - overwrite_quant : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the kallisto quantification step is skiped if a quantification file already exists. If TRUE, the kallisto quantification step will be run even if a quantification file already exists. - overwrite_calls : slot of the object KallistoMetadata. The value has to be a logical. If FALSE, the generation of present/absent calls is skiped if an index already exists. If TRUE (default), the generation of present/absent calls will be run even if calls were already generated.

### Ignore transcript version

It can happen that calls are generated using transcriptome or annotations containing transcript version (number after a dot in transcript IDs e.g ENSMUST00000082908.2) and annotations or transcriptome without transcript version. This is problematic when using tximport to transform abundance at transcript level to abundance at gene level and result in an error.

Error in .local(object, ...) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Example IDs (file): [, ...]
Example IDs (tx2gene): [ENSMUST00000193812, ENSMUST00000082908, ENSMUST00000192857, ...]
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
Calls: do.call ... tximport -> summarizeToGene -> summarizeToGene -> .local
Execution halted

To solve this error tximport implemented an option called ignoreTxVersion that remove the transcript version from both transcriptome and annotations. It is possible to use this option by modifying the value of the slot ignoreTxVersion (default FALSE) of the S4 class KallistoMetadata.

### Generate calls with a simple arborescence of directories

By default the arborescence of directories created by BgeeCall is as simple as possible. the results will be created using the path working_path/intergenic_release/all_results/libraryId. Generating present/absent gene expression calls for the same RNA-Seq library using different transcriptome or annotation versions using this arborescence will overwrite previous results. The UserMetadata class has an attribute simple_arborescence that is TRUE by default. If FALSE, a complexe arborescence of directories containing the name of the annotation and transcriptome files will be created. This complex arborescence will then allow to generate present/absent calls for the same library using different version of transcriptome or annotaiton.

### Change directory where calls are saved

By default directories used to save present/absent calls are subdirectories of UserMetadata@working_path. However it is possible to select the directory where you want the calls to be generated.

This output directory will only contains results generated at the RNA-Seq library level. All data generated at species level are still stored using the UserMetadata@working_path. They can then still be reused to generate calls from other libraries of the same species.

### Modify slurm options

Two functions are available to run BgeeCall on a slurm queuing system. Parameters described below are available for both of them.

#### Number of jobs

The full idea of using a cluster is to parallelize your jobs. By default 10 jobs are run at the same time. It is possible to modify this number with the parameter nodes.

#### Do not submit the jobs

In order to be able to check files automatically generated to run the jobs it is possible to generate these files without submiting your jobs. More information on created files are available on the vignette of the (https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm.html)[rslurm package].

#### Modify slurm options

A bash scirpt is automatically created to run the jobs. This script contains default slurm options (array, cpus-per-task, job-name, output). All other slurm options recognized by the sbatch command can be updated b creating a named list where name correspond to long name of options (e.g do not use ‘p’ but ‘partition’).

In some cluster programs are not loaded by default. The modules parameter allows to load them by adding one line in the sbatch script. This option has been implemented to add modules but could potentially be used to add any custom line of code in the sbatch script.

#### Modify BgeeCall objects

By default except for columns present in the tsv file all other slots of the 3 BgeeCall classes will use default values. In order to tune these parameters it is possible to create the objects and pass them to the slurm functions. When generating these objects it is mandatory to keep the same name as in the example below.

## Merging multiple libraries

In the BgeeCall package we have the possibility to perform calls of expression by merging/combine multiple libraries that belongs to a particular condition. In order to use this functionality, the calls at individual library need to be done by using the p-value or q-value approach, since this approaches provide for each $$gene_i$$ a quantitative metric.

If the calls of expression genes, at individual library, was done using the pValue approach, then the merging will be done by using the Benjamini & Hochberg method.

For this method a set of pValues are collected, for each $$gene_i$$ across $$n$$ libraries, and then the referent p.adjusted values are computed. To call expressed genes in the merging, we verify if one of the p.adjusted values is lower then the cutoff (threshold value provided by the user), if yes, the $$gene_i$$ is classified as present for the condition, otherwise is absent.

mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")

On the other side, if the approach used to call expressed genes, at individual library, was the qValue, then the merging will be done by using the inverse fdr correction, this means by applying the correspondent formula:

$FDR_{inverse} = (1-((1-q)^{(\frac {1}{n})}))$ where $$q$$ is the desired cut-off (threshold value provided by the user) and $$n$$ is the number of libraries.

# merging libraries
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")

To call expressed genes in the merging condition, we verify if one of the libraries have a q-Value lower then the FDR_inverse, if yes, the $$gene_i$$ is classified as present, otherwise is classified as absent.

### Arguments and user file to perform the merging

In the merging_libraries() function, different parameters need to be passed to the referent arguments: userFile, approach, condition, cutoff or outDir.

In the userFile argument a path to a text file containing all information about the target libraries to be merged. This file should contain a fundamental column called: specied_id in order to run the merging_libraries() function. This column should not be empty and also an important request is that you should have at least 2 libraries to be processed, since this is a merging process.

If you want to merge/combine libraries for a more detailed condition you need to add the correspondent extra columns to your user file. Note that the argument condition recognize the following parameters: specied_id, anatEntity, devStage, sex and strain. This means that your table columns also should have the same ID. The order of the columns are not relevant when the user file is structured, as well as, the order of the parameters passed to the condition argument.

Is always recommended to use “-” instead of empty rows when some information is missing, as represented in the BgeeCall/inst/userMetadataTemplate_merging.tsv

The argument approach allow to specify the method that will be used to make the calls of expression in the merging libraries, and as described before, the condition argument will allow to specify the particular condition of interest, the condition argument allow to target the correspondent libraries where the merging will occur.

More restricted cut-off can be applied to call expressed genes in the merging libraries for a target condition, for this the cutoff argument can be modified, otherwise by default is 0.05.

#Session Info

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_GB              LC_COLLATE=C
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
#> [8] methods   base
#>
#> other attached packages:
#> [1] rtracklayer_1.52.0   GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
#> [4] IRanges_2.26.0       S4Vectors_0.30.0     BiocGenerics_0.38.0
#> [7] BgeeCall_1.8.0
#>
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7                  matrixStats_0.58.0
#>  [3] bit64_4.0.5                   insight_0.14.0
#>  [5] filelock_1.0.2                progress_1.2.2
#>  [7] httr_1.4.2                    tools_4.1.0
#>  [9] bslib_0.2.5.1                 utf8_1.2.1
#> [11] R6_2.5.0                      sjlabelled_1.1.8
#> [13] DBI_1.1.1                     rhdf5filters_1.4.0
#> [15] withr_2.4.2                   tidyselect_1.1.1
#> [17] prettyunits_1.1.1             bit_4.0.4
#> [19] curl_4.3.1                    compiler_4.1.0
#> [21] Biobase_2.52.0                DelayedArray_0.18.0
#> [25] rappdirs_0.3.3                stringr_1.4.0
#> [27] digest_0.6.27                 Rsamtools_2.8.0
#> [29] rmarkdown_2.8                 XVector_0.32.0
#> [31] pkgconfig_2.0.3               htmltools_0.5.1.1
#> [33] MatrixGenerics_1.4.0          BSgenome_1.60.0
#> [35] dbplyr_2.1.1                  fastmap_1.1.0
#> [37] rlang_0.4.11                  RSQLite_2.2.7
#> [39] shiny_1.6.0                   jquerylib_0.1.4
#> [41] BiocIO_1.2.0                  generics_0.1.0
#> [43] jsonlite_1.7.2                BiocParallel_1.26.0
#> [45] dplyr_1.0.6                   RCurl_1.98-1.3
#> [47] magrittr_2.0.1                GenomeInfoDbData_1.2.6
#> [49] Matrix_1.3-3                  Rcpp_1.0.6
#> [51] Rhdf5lib_1.14.0               fansi_0.4.2
#> [53] lifecycle_1.0.0               stringi_1.6.2
#> [55] yaml_2.2.1                    SummarizedExperiment_1.22.0
#> [57] zlibbioc_1.38.0               rhdf5_2.36.0
#> [59] BiocFileCache_2.0.0           AnnotationHub_3.0.0
#> [61] grid_4.1.0                    rslurm_0.6.0
#> [63] blob_1.2.1                    promises_1.2.0.1
#> [65] sjmisc_2.8.7                  crayon_1.4.1
#> [67] lattice_0.20-44               Biostrings_2.60.0
#> [69] GenomicFeatures_1.44.0        hms_1.1.0
#> [71] KEGGREST_1.32.0               knitr_1.33
#> [73] pillar_1.6.1                  rjson_0.2.20
#> [75] biomaRt_2.48.0                XML_3.99-0.6
#> [77] glue_1.4.2                    BiocVersion_3.13.1
#> [79] evaluate_0.14                 data.table_1.14.0
#> [81] BiocManager_1.30.15           httpuv_1.6.1
#> [83] png_0.1-7                     vctrs_0.3.8
#> [85] purrr_0.3.4                   assertthat_0.2.1
#> [87] cachem_1.0.5                  xfun_0.23
#> [89] mime_0.10                     xtable_1.8-4
#> [91] restfulr_0.0.13               later_1.2.0
#> [93] tibble_3.1.2                  GenomicAlignments_1.28.0
#> [95] AnnotationDbi_1.54.0          memoise_2.0.0
#> [97] tximport_1.20.0               ellipsis_0.3.2
#> [99] interactiveDisplayBase_1.30.0