# 1 Introduction

DifferentialRegulation is a method for detecting differentially regulated genes between two groups of samples (e.g., healthy vs. disease, or treated vs. untreated samples), by targeting differences in the balance of spliced (S) and unspliced (U) mRNA abundances, obtained from single-cell RNA-sequencing (scRNA-seq) data.

Below, we briefly illustrate the main conceptual and mathematical aspects. For more details, a pre-print will follow shortly (~summer 2022).

## 1.1 Conceptual idea

DifferentialRegulation is based on a similar rationale to RNA velocity tools, notably velocyto (La Manno et al. 2018) and scVelo (Bergen et al. 2020), which compare spliced and unspliced abundances to their equilibrium levels.

Intuitively, if a large fraction of U is present for a gene, this will be spliced and increase the relative abundance of S. Conversely, if a small fraction of U is present for a gene, the newly spliced mRNA will not compensate the degradation of the (already) spliced mRNA, and the proportion of S will decrease in the short term. Therefore, in the two examples above, the gene is currently being up- and down-regulated, respectively; i.e., gene expression is going to increase and decrease, respectively.

We extend this argument to compare the relative abundances of S and U reads across groups of samples. In particular, a higher proportion of unspliced (spliced) mRNA in a condition suggests that a gene is currently being up- (down-) regulated compared to the other condition.

While canonical differential gene expression focuses on changes in overall gene abundance, DifferentialRegulation discovers differences (between conditions) in the near future changes of (spliced) gene expression (conceptually similar to the derivative of S respect to time).

Similarly to RNA velocity tools, DifferentialRegulation is an instrument to facilitate discoveries in the context of development.

## 1.2 Mathematical details

From a mathematical point of view, DifferentialRegulation accounts for the sample-to-sample variability, and embeds multiple samples in a Bayesian hierarchical model. Our method also deals with two major sources of mapping uncertainty: i) ‘ambiguous’ reads, compatible with both spliced and unspliced versions of a gene, and ii) reads mapping to multiple genes. In particular, ambiguous reads are treated separately from spliced and unsplced reads; furthermore, when providing equivalence classes data (via EC_list), reads that are compatible with multiple genes are allocated to the gene of origin.

DifferentialRegulation uses two nested models:

• a Dirichlet-multinomial ($$DM$$) for the proportions of unspliced, spliced and ambiguous (USA) reads in each gene: $$DM(\pi_U, \pi_S, \pi_A, \delta)$$, where $$\pi_U, \pi_S, \pi_A$$ indicate the (group-level) relative abundances of U, S and A counts, and $$\delta$$ represents the precision parameter, modelling the degree of over-dispersion between samples;
• a multinomial ($$MN$$) for the (sample-specific) relative abundance of genes in each sample: $$MN(\pi^i_1, ..., \pi^i_{n_g})$$, where $$\pi^i_g$$ is the relative abundance of the $$g$$-th gene in the $$i$$-th sample.

The $$DM$$ model is the main focus here, and the one which is used for differential testing between conditions, while the $$MN$$ model is necessary for allocating reads across genes.

Parameters are inferred via Markov chain Monte Carlo (MCMC) techniques (Metropolis-within-Gibbs), and differential testing is performed by comparing $$(\pi_U, \pi_S, \pi_A)$$ between conditions.

## 1.3 Bioconductor installation

DifferentialRegulation is available on Bioconductor and can be installed with the command:

if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("DifferentialRegulation")

To access the R code used in the vignettes, type:

browseVignettes("DifferentialRegulation")

## 1.4 Questions and issues

Questions relative to DifferentialRegulation should be reported as a new issue at BugReports.

To cite DifferentialRegulation, type:

citation("DifferentialRegulation")
##
## To cite package 'DifferentialRegulation' in publications use:
##
##   Tiberi S (2022). _DifferentialRegulation: Differentially regulated
##   genes from scRNA-seq data_. R package version 1.3.0,
##   <https://github.com/SimoneTiberi/DifferentialRegulation>.
##
## A BibTeX entry for LaTeX users is
##
##   @Manual{,
##     title = {DifferentialRegulation: Differentially regulated genes from scRNA-seq data},
##     author = {Simone Tiberi},
##     year = {2022},
##     note = {R package version 1.3.0},
##     url = {https://github.com/SimoneTiberi/DifferentialRegulation},
##   }

# 2 Input data: alignment and quantification with alevin-fry

DifferentialRegulation inputs scRNA-seq data, aligned via alevin-fry (He et al. 2022).

NOTE: when using alevin-fry, set options:

• --d (or --dump-eqclasses), to obtain the equivalence classes;
• --use-mtx, to store counts in a quants_mat.mtx file (as expected by our load_USA function).

We also recommend using the --CR-like-EM option, which also allows equivalence classes of reads that map to multiple genes.

# 3 Pipeline

library(DifferentialRegulation)

We use a real droplet scRNA-seq dataset from Velasco et al. (2019). In particular, we compare two groups of three samples, consisting of human brain organoids, cultured for 3 and 6 months. For computational reasons, we stored a subset of this dataset, in our package, consisting of 100 genes and 3,493 cells, belonging to two cell-types. Cell-type assignment was done in the original styudy (Velasco et al. 2019). For more information about the data, refer to the original study here.

We specify the directory of the data (internal in the package).

data_dir = system.file("extdata", package = "DifferentialRegulation")

Specify the directory of the USA (unspliced, spliced and ambiguous) estimated counts, inferred via alevin-fry.

# specify samples ids:
sample_ids = paste0("organoid", c(1:3, 16:18))
# set directories of each sample input data (obtained via alevin-fry):
base_dir = file.path(data_dir, "alevin-fry", sample_ids)

# Note that alevin-fry needs to be run with --use-mtx option to store counts in a quants_mat.mtx file.
path_to_counts = file.path(base_dir,"/alevin/quants_mat.mtx")
path_to_cell_id = file.path(base_dir,"/alevin/quants_mat_rows.txt")
path_to_gene_id = file.path(base_dir,"/alevin/quants_mat_cols.txt")

Specify the directory of the ECs and respective counts, inferred via alevin-fry.

path_to_EC_counts = file.path(base_dir,"/alevin/geqc_counts.mtx")
path_to_EC = file.path(base_dir,"/alevin/gene_eqclass.txt.gz")

Load the unspliced, spliced and ambiguous (USA) counts, quantified by alevin-fry, in a SingleCellExperiment. By default, counts (stored in assays(sce)$counts) are defined as summation of spliced read and 50% of ambiguous reads (i.e., reads compatible with both spliced and unspliced versions of a gene): counts = spliced + 0.5 * ambiguous. sce = load_USA(path_to_counts, path_to_cell_id, path_to_gene_id, sample_ids) sce ## class: SingleCellExperiment ## dim: 100 3493 ## metadata(0): ## assays(4): spliced unspliced ambiguous counts ## rownames(100): ENSG00000223972.5 ENSG00000243485.5 ... ## ENSG00000162592.10 ENSG00000235169.11 ## rowData names(0): ## colnames(3493): organoid1.GCCTCTAGTCGGATCC organoid1.GAACATCGTCTGATCA ## ... organoid18.AGCGTCGAGCGCCTCA organoid18.GATCGCGTCGTTACGA ## colData names(1): sample_id ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Cell types should be assigned to each cell; here we load pre-computed cell types. path_to_DF = file.path(data_dir,"DF_cell_types.txt") DF_cell_types = read.csv(path_to_DF, sep = "\t", header = TRUE) matches = match(colnames(sce), DF_cell_types$cell_id)
sce$cell_type = DF_cell_types$cell_type[matches]
table(sce$cell_type) ## ## Cycling RG ## 2399 1094 ### 3.1.2 Load equivalence classes (EC) Load the equivalence classes and respective counts (only needed when performing differential testing on ECs). EC_list = load_EC(path_to_EC_counts, path_to_EC, path_to_cell_id, path_to_gene_id, sample_ids) ## The percentage of multi-gene mapping reads in sample 'organoid1' is: 11.09 ## The percentage of multi-gene mapping reads in sample 'organoid2' is: 13.53 ## The percentage of multi-gene mapping reads in sample 'organoid3' is: 7.19 ## The percentage of multi-gene mapping reads in sample 'organoid16' is: 8.59 ## The percentage of multi-gene mapping reads in sample 'organoid17' is: 3.75 ## The percentage of multi-gene mapping reads in sample 'organoid18' is: 6.76 For every sample, load_EC prints the percentage of reads compatible with multiple genes (i.e., multi-gene mapping reads). Here multi-gene reads are relatively low, because we are considering a subset of 100 genes only; however, in the full dataset we found that approximately 40% of reads map to multiple genes. Intuitively, the larger these numbers, the greater the benefits one may achieve by using ECs and modelling the variability of these uncertain gene allocations. ## 3.2 QC and filtering Quality control (QC) and filtering of low quality cells can be performed as usual on the sce object. The sce object computed via load_USA contains a counts assays, defined as the summation of spliced counts and 50% of ambiguous counts. For examples of QC, you can refer to the OSCA book (Amezquita et al. 2020). Importantly, cells only need to be filtered in the sce object, and NOT in the EC_list object: cells that are filtered in sce will also be removed from ECs by compute_PB_counts function. ## 3.3 Differential regulation testing Differential testing can be performed on USA estimated counts (faster) or on ECs (slower, but more accurate). Using EC counts allows to explicitly model the uncertainty from reads that map to multiple genes. First, we define the design of the study: in our case we have 2 groups, that we call “A” and “B” of 2 samples each. design = data.frame(sample = sample_ids, group = c( rep("3 mon", 3), rep("6 mon", 3) )) design ## sample group ## 1 organoid1 3 mon ## 2 organoid2 3 mon ## 3 organoid3 3 mon ## 4 organoid16 6 mon ## 5 organoid17 6 mon ## 6 organoid18 6 mon Compute pseudo-bulk (PB) onbject needed for differential testing. PB_counts = compute_PB_counts(sce = sce, EC_list = EC_list, design = design, sample_col_name = "sample", group_col_name = "group", sce_cluster_name = "cell_type") ## the following cell clusters (e.g., cell types) have more than 100 cells and will be analyzed: ## Cycling --- RG NB: to reduce memory usage, we can remove the EC_list object, which typically requires a large amount of memory, particularly in large datasets. If needed, the sce object can be removed as well, as neither is needed for differential testing. rm(EC_list) ### 3.3.1 USA testing To perform differential testing on USA esitmated counts, set EC to FALSE. # sce-based test: set.seed(169612) results_USA = DifferentialRegulation(PB_counts, EC = FALSE) ## 'EC' was set to 'FALSE': estimated counts will be used to perform differential testing (faster, but marginally less accurate). ## We recommend using equivalence classes counts (slower, but marginally more accurate). ## 'n_cores' was left 'NULL'. ## Since tasks are paralellized on cell clusters, we will set 'n_cores' to the number of clusters that will be analyzed. ## 'n_cores' set to: 2 ## 3.4 Visualizing results In both USA and ECs approaches, DifferentialRegulation function returns of a list of 4 data.frames: • Differential_results, which contains results from differential testing only; • US_results, that has results for the proportion of Spliced and Unspliced counts (where Ambiguous counts are allocated 50:50 to Spliced and Unspliced); • USA_results, which includes results for the proportion of Spliced, Unspliced and Ambiguous counts (Ambiguous counts are reported separately from Spliced and Unspliced counts); • Convergence_results, that contains information about convergence of posterior chains. names(results_EC) ## [1] "Differential_results" "US_results" "USA_results" ## [4] "Convergence_results" In Differential_results element, columns Gene_id and Cluster_id contain the gene and cell-cluster name, while p_val, p_adj.loc and p_adj.glb report the raw p-values, locally and globally adjusted p-values, via Benjamini and Hochberg (BH) correction. In locally adjusted p-values (p_adj.loc) BH correction is applied to each cluster separately, while in globally adjusted p-values (p_adj.glb) BH correction is performed to the results from all clusters. head(results_EC$Differential_results, 3)
##               Gene_id Cluster_id        p_val    p_adj.glb    p_adj.loc
## 7   ENSG00000176022.7    Cycling 8.706603e-06 0.0003482641 0.0002089585
## 29 ENSG00000157933.10         RG 1.994047e-04 0.0039818533 0.0023891120
## 30  ENSG00000176022.7         RG 2.986390e-04 0.0039818533 0.0023891120

In US_results and USA_results elements, pi and sd indicate the estimated proportion and standard deviation, respectively, S, U and A refer to Spliced, Unspliced and Ambiguous counts, respectively, while 3 mon and 6 mon refer to the groups, as named in the design. For instance, columns pi_S-3 mon and sd_S-3 mon indicate the estimate (posterior mean) and standard deviation (sd) for the proportion of Spliced (pi_S) and Unspliced (pi_U) counts in group 3 mon, respectively.

We visualize US results.

head(results_EC$US_results, 3) ## Gene_id Cluster_id p_val p_adj.glb p_adj.loc ## 7 ENSG00000176022.7 Cycling 8.706603e-06 0.0003482641 0.0002089585 ## 29 ENSG00000157933.10 RG 1.994047e-04 0.0039818533 0.0023891120 ## 30 ENSG00000176022.7 RG 2.986390e-04 0.0039818533 0.0023891120 ## pi_S-3 mon pi_U-3 mon pi_S-6 mon pi_U-6 mon sd_S-3 mon sd_U-3 mon sd_S-6 mon ## 7 0.5829504 0.4170496 0.8232916 0.1767084 0.05415442 0.05415442 0.11728021 ## 29 0.0274276 0.9725724 0.3850561 0.6149439 0.00781943 0.00781943 0.08704723 ## 30 0.8551297 0.1448703 0.5380567 0.4619433 0.07496266 0.07496266 0.08476816 ## sd_U-6 mon ## 7 0.11728021 ## 29 0.08704723 ## 30 0.08476816 We visualize USA results. head(results_EC$USA_results, 3)
##               Gene_id Cluster_id        p_val    p_adj.glb    p_adj.loc
## 7   ENSG00000176022.7    Cycling 8.706603e-06 0.0003482641 0.0002089585
## 29 ENSG00000157933.10         RG 1.994047e-04 0.0039818533 0.0023891120
## 30  ENSG00000176022.7         RG 2.986390e-04 0.0039818533 0.0023891120
##    pi_S-3 mon pi_U-3 mon pi_A-3 mon pi_S-6 mon pi_U-6 mon pi_A-6 mon
## 7  0.25312785 0.08722696 0.65964518  0.7520064  0.1054233 0.14257036
## 29 0.01508427 0.96022907 0.02468667  0.3666250  0.5965127 0.03686228
## 30 0.81666799 0.10640859 0.07692342  0.1948836  0.1187701 0.68634637
##     sd_S-3 mon sd_U-3 mon  sd_A-3 mon sd_S-6 mon sd_U-6 mon sd_A-6 mon
## 7  0.054735610 0.06534304 0.052920650 0.14623647 0.10292961 0.09455491
## 29 0.006582283 0.01005951 0.006668304 0.08965525 0.08800258 0.03544229
## 30 0.086695067 0.06750946 0.040859781 0.13002822 0.13888689 0.20892880

We can also visualize information about the convergence of the posterior chains.

results_EC$Convergence_results ## Cluster_id burn_in_one_cl N_MCMC_one_cl first_chain ## 1 Cycling 500 2000 converged ## 2 RG 500 2000 converged ## second_chain ## 1 NOT run (1st chain converged) ## 2 NOT run (1st chain converged) Finally, we can plot the estimated proportions of spliced and unspliced reads. If CI = TRUE (default option), for each estimate, we can also add the respective profile Wald type confidence interval, of level CI_level (0.95 by default). Similarly to above, we can plot the proportion of US or USA reads. Note that, although US reads are easier to interpret, USA reads more closely reflect what is being compared between conditions. plot_pi(results_EC, type = "US", gene_id = results_EC$Differential_results$Gene_id[1], cluster_id = results_EC$Differential_results$Cluster_id[1]) plot_pi(results_EC, type = "USA", gene_id = results_EC$Differential_results$Gene_id[1], cluster_id = results_EC$Differential_results\$Cluster_id[1])

# 4 Session info

sessionInfo()
## R Under development (unstable) (2022-10-25 r83175)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_GB              LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] DifferentialRegulation_1.3.0 BiocStyle_2.27.0
##
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.29.0 gtable_0.3.1
##  [3] xfun_0.34                   bslib_0.4.0
##  [5] ggplot2_3.3.6               Biobase_2.59.0
##  [7] lattice_0.20-45             vctrs_0.5.0
##  [9] tools_4.3.0                 BANDITS_1.15.0
## [11] bitops_1.0-7                generics_0.1.3
## [13] stats4_4.3.0                parallel_4.3.0
## [15] tibble_3.1.8                DRIMSeq_1.27.0
## [17] fansi_1.0.3                 highr_0.9
## [19] R.oo_1.25.0                 pkgconfig_2.0.3
## [21] Matrix_1.5-1                data.table_1.14.4
## [23] rngtools_1.5.2              S4Vectors_0.37.0
## [25] assertthat_0.2.1            lifecycle_1.0.3
## [27] GenomeInfoDbData_1.2.9      farver_2.1.1
## [29] compiler_4.3.0              stringr_1.4.1
## [31] munsell_0.5.0               codetools_0.2-18
## [33] GenomeInfoDb_1.35.0         htmltools_0.5.3
## [35] sass_0.4.2                  RCurl_1.98-1.9
## [37] yaml_2.3.6                  pillar_1.8.1
## [39] jquerylib_0.1.4             R.utils_2.12.1
## [41] MASS_7.3-58.1               SingleCellExperiment_1.21.0
## [43] BiocParallel_1.33.0         DelayedArray_0.25.0
## [45] cachem_1.0.6                limma_3.55.0
## [47] magick_2.7.3                doRNG_1.8.2
## [49] iterators_1.0.14            foreach_1.5.2
## [51] tidyselect_1.2.0            locfit_1.5-9.6
## [53] digest_0.6.30               stringi_1.7.8
## [55] dplyr_1.0.10                reshape2_1.4.4
## [57] bookdown_0.29               labeling_0.4.2
## [59] fastmap_1.1.0               grid_4.3.0
## [61] colorspace_2.0-3            cli_3.4.1
## [63] magrittr_2.0.3              utf8_1.2.2
## [65] edgeR_3.41.0                scales_1.2.1
## [67] rmarkdown_2.17              XVector_0.39.0
## [69] matrixStats_0.62.0          R.methodsS3_1.8.2
## [71] evaluate_0.17               knitr_1.40
## [73] GenomicRanges_1.51.0        IRanges_2.33.0
## [75] doParallel_1.0.17           rlang_1.0.6
## [77] Rcpp_1.0.9                  glue_1.6.2
## [79] DBI_1.1.3                   BiocManager_1.30.19
## [81] BiocGenerics_0.45.0         jsonlite_1.8.3
## [83] R6_2.5.1                    plyr_1.8.7
## [85] MatrixGenerics_1.11.0       zlibbioc_1.45.0

# References

Amezquita, Robert A, Aaron TL Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17 (2): 137–45.

Bergen, Volker, Marius Lange, Stefan Peidli, F Alexander Wolf, and Fabian J Theis. 2020. “Generalizing Rna Velocity to Transient Cell States Through Dynamical Modeling.” Nature Biotechnology 38 (12): 1408–14.

He, Dongze, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. 2022. “Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data.” Nature Methods 19 (3): 316–22.

La Manno, Gioele, Ruslan Soldatov, Amit Zeisel, Emelie Braun, Hannah Hochgerner, Viktor Petukhov, Katja Lidschreiber, et al. 2018. “RNA Velocity of Single Cells.” Nature 560 (7719): 494–98.

Velasco, Silvia, Amanda J Kedaigle, Sean K Simmons, Allison Nash, Marina Rocha, Giorgia Quadrato, Bruna Paulsen, et al. 2019. “Individual Brain Organoids Reproducibly Form Cell Diversity of the Human Cerebral Cortex.” Nature 570 (7762): 523–27.