# 1 Introduction

Over the past decade, advances in single-cell RNA-sequencing (scRNA-seq) technologies have significantly increased the sensitivity and specificity with which cellular transcriptional dynamics can be analyzed. Further, parallel increases in the number cells which can be simultaneously sequenced have allowed for novel analysis pipelines including the description of transcriptional trajectories and the discovery of rare sub-populations of cells. The development of droplet-based, unique-molecular-identifier (UMI) protocols such as Drop-seq, inDrop, and the 10x Genomics Chromium platform have significantly contributed to these advances. In particular, the commercially available 10x Genomics platform has allowed the rapid and cost effective gene expression profiling of hundreds to tens of thousands of cells across many studies to date. The use of UMIs in the 10x Genomics and related platforms has augmented these developments in sequencing technology by tagging individual mRNA transcripts with unique cell and transcript specific identifiers. In this way, biases due to transcript length and PCR amplification have been significantly reduced. However, technical variability in sequencing depth remains and, consequently, normalization to adjust for sequencing depth is required to ensure accurate downstream analyses. To address this, we introduce Dino and Dino, its corresponding R package.

Dino utilizes a flexible mixture of Negative Binomials model of gene expression to reconstruct full gene-specific expression distributions which are independent of sequencing depth. By giving exact zeros positive probability, the Negative Binomial components are applicable to shallow sequencing (high proportions of zeros). Additionally, the mixture component is robust to cell heterogeneity as it accommodates multiple centers of gene expression in the distribution. By directly modeling (possibly heterogenous) gene-specific expression distributions, Dino outperforms competing approaches, especially for datasets in which the proportion of zeros is high as is typical for modern, UMI based protocols.

Dino does not attempt to correct for batch or other sample specific effects, and will only do so to the extent that they are correlated with sequencing depth. In situations where batch effects are expected, downstream analysis may benefit from such accommodations.

# 2 Quick Start

## 2.1 Installation

To install Dino from Github, run

devtools::install_github('JBrownBiostat/Dino')

## 2.2 All-in-one function

Dino (function) is an all-in-one function to normalize raw UMI count data from 10X Cell Ranger or similar protocols. Under default options, Dino outputs a sparse matrix of normalized expression. SeuratFromDino provides one-line functionality to return a Seurat object from raw UMI counts or from a previously normalized expression matrix.

library(Dino)

# Return a sparse matrix of normalized expression
Norm_Mat <- Dino(UMI_Mat)

# Return a Seurat object from already normalized expression
# Use un-transformed normalized expression
Norm_Seurat <- SeuratFromDino(Norm_Mat, doNorm = FALSE, doLog = FALSE)

# Return a Seurat object from UMI expression
# Transform normalized expression as log(x + 1) to improve
# some types of downstream analysis
Norm_Seurat <- SeuratFromDino(UMI_Mat)

# 3 Detailed steps

To facilitate concrete examples, we demonstrate normalization on a small subset of sequencing data from about 3,000 peripheral blood mononuclear cells (PBMCs) published by 10X Genomics. This dataset, named pbmcSmall contains 200 cells and 1,000 genes and is included with the Dino package.

set.seed(1)

# Bring pbmcSmall into R environment
library(Dino)
library(Seurat)
library(Matrix)
data("pbmcSmall")
print(dim(pbmcSmall))
## [1] 1000  200

While Dino was developed to normalize UMI count data, it will run on any matrix of non-negative expression data; user caution is advised if applying Dino to non-UMI sequencing protocols. Input formats may be sparse or dense matrices of expression with genes (features) on the rows and cells (samples) on the columns.

## 3.2 Clean UMI data

While Dino can normalize the pbmcSmall dataset as it currently exists, the resulting normalized matrix, and in particular, downstream analysis are likely to be improved by cleaning the data. Of greatest use is removing genes that are expected not to contain useful information. This set of genes may be case dependent, but a good rule of thumb for UMI protocols is to remove genes lacking a minimum of non-zero expression prior to normalization and analysis.

By default, Dino will not perform the resampling algorithm on any genes without at least 10 non-zero samples, and will rather normalize such genes by scaling with sequencing depth. To demonstrate a stricter threshold, we remove genes lacking at least 20 non-zero samples prior to normalization.

# Filter genes for a minimum of non-zero expression
pbmcSmall <- pbmcSmall[rowSums(pbmcSmall != 0) >= 20, ]
print(dim(pbmcSmall))
## [1] 907 200

## 3.3 Normalize UMI data

Dino contains several options to tune output. One of particular interest is nCores which allows for parallel computation of normalized expression. By default, Dino runs with two threads. Choosing nCores = 0 will utilize all available cores, and otherwise an integer number of parallel instances can be chosen.

# Normalize data
pbmcSmall_Norm <- Dino(pbmcSmall)

## 3.4 Clustering with Seurat

After normalization, Dino makes it easy to perform data analysis. The default output is the normalized matrix in sparse format, and Dino additionally provides a function to transform normalized output into a Seurat object. We demonstrate this by running a quick clustering pipeline in Seurat. Much of the pipeline is modified from the tutorial at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

# Reformat normalized expression as a Seurat object
pbmcSmall_Seurat <- SeuratFromDino(pbmcSmall_Norm, doNorm = FALSE)

# Cluster pbmcSmall_Seurat
pbmcSmall_Seurat <- FindVariableFeatures(pbmcSmall_Seurat,
selection.method = "mvp")
pbmcSmall_Seurat <- ScaleData(pbmcSmall_Seurat,
features = rownames(pbmcSmall_Norm))
pbmcSmall_Seurat <- RunPCA(pbmcSmall_Seurat,
features = VariableFeatures(object = pbmcSmall_Seurat),
verbose = FALSE)
pbmcSmall_Seurat <- FindNeighbors(pbmcSmall_Seurat, dims = 1:10)
pbmcSmall_Seurat <- FindClusters(pbmcSmall_Seurat, verbose = FALSE)
pbmcSmall_Seurat <- RunUMAP(pbmcSmall_Seurat, dims = 1:10)
DimPlot(pbmcSmall_Seurat, reduction = "umap")

## 3.5 Normalizing data formatted as SingleCellExperiment

Dino additionally supports the normalization of datasets formatted as SingleCellExperiment. As with the Seurat pipeline, this functionality is implemented through the use of a wrapper function. We demonstrate this by quickly converting the pbmcSmall dataset to a SingleCellExperiment object and then normalizing.

# Reformatting pbmcSmall as a SingleCellExperiment
library(SingleCellExperiment)
pbmc_SCE <- SingleCellExperiment(assays = list("counts" = pbmcSmall))

# Run Dino
pbmc_SCE <- Dino_SCE(pbmc_SCE)
str(normcounts(pbmc_SCE))
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:162894] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ p       : int [1:201] 0 808 1610 2429 3227 4031 4853 5662 6458 7272 ...
##   ..@ Dim     : int [1:2] 907 200
##   ..@ Dimnames:List of 2
##   .. ..$: chr [1:907] "ENSG00000087086" "ENSG00000167996" "ENSG00000251562" "ENSG00000205542" ... ## .. ..$ : chr [1:200] "CCAACCTGACGTAC-1" "ATCTGGGATTCCGC-1" "TACTTTCTTTTGGG-1" "CAGGCCGAACACGT-1" ...
##   ..@ x       : num [1:162894] 105.38 29.15 13.05 31.54 9.54 ...
##   ..@ factors : list()

## 3.6 Alternate sequencing depth

By default, Dino computes sequencing depth, which is corrected for in the normalized data, as the sum of expression for a cell (sample) across genes. This sum is then scaled such that the median depth is 1. For some datasets, however, it may be beneficial to run Dino on an alternately computed set of sequencing depths. Note: it is generally recommended that the median depth not be far from 1 as this corresponds to recomputing expression as though all cells had been sequenced at the median depth.

A simple pipeline to compute alternate sequencing depths utilizes the Scran method for computing normalization scale factors, and is demonstrated below.

library(scran)

# Compute scran size factors
scranSizes <- calculateSumFactors(pbmcSmall)

# Re-normalize data
pbmcSmall_SNorm <- Dino(pbmcSmall, nCores = 1, depth = log(scranSizes))

# 4 Method

## 4.1 Model

Dino models observed UMI counts as a mixture of Negative Binomial random variables. The Negative Binomial distribution can, however, be decomposed into a hierarchical Gamma-Poisson distribution, so for gene $$g$$ and cell $$j$$, the Dino model for UMI counts is: $y_{gj}\sim f^{P}(\lambda_{gj}\delta_{j})\\ \lambda_{gj}\sim\sum_{K}\pi_{k}f^{G}\left(\frac{\mu_{gk}}{\theta_g},\theta_g\right)$ where $$f^{P}$$ is a Poisson distribution parameterized by mean $$\lambda_{gj}\delta_{j}$$ and $$f^{G}$$ is a Gamma distribution parameterized by shape $$\mu_{gk}/\theta_g$$ and scale $$\theta_g$$. $$\delta_{j}$$ is the cell-specific sequencing depth, $$\lambda_{gj}$$ is the latent level of gene/cell-specific expression independent of depth, component probabilities $$\pi_{k}$$ sum to 1, the Gamma distribution is parameterized such that $$\mu_{gk}$$ denotes the distribution mean, and the Gamma scale paramter, $$\theta_g$$, is constant across mixture components.

Following model fitting for a fixed gene through an accelerated EM algorithm, Dino produces normalized expression values by resampling from the posterior distribution of the latent expression parameters, $$\lambda_{gj}$$. It can be shown that the distribution on the $$\lambda_{j}$$ (dropping the gene-specific subscript $$g$$ as calculations are repreated across genes) is a mixture of Gammas, specifically: $\mathbb{P}(\lambda_{j}|y_{j},\delta_j)=\sum_{K}\tau_{kj}f^{G}\left(\frac{\mu_{k}}{\theta}+\gamma y_{j},\frac{1}{\frac{1}{\theta}+\gamma\delta_j}\right)$ where $$\tau_{kj}$$ denotes the conditional probability that $$\lambda_{gj}$$ was sampled from mixture component $$k$$ and $$\gamma$$ is a global concentration parameter. The $$\tau_{kj}$$ are estimated as part of the implementation of the EM algorithm in Dino. The adjustment from the concentration parameter can be seen as a bias in the normalized values towards a scale-factor version of normalization, since, in the limit of $$\gamma$$, the normalized expression for cell $$j$$ converges to $$y_j/\delta_j$$. Default values of $$\gamma=15$$ have proven successful.

## 4.2 Mixture components $$K$$

Approximating the flexibility of a non-parametric method, Dino uses a large number of mixture components, $$K$$, in order to capture the full heterogeneity of expression that may exist for a given gene. The gene-specific number of components is estimated as the square root of the number of strictly positive UMI counts for a given gene. By default, $$K$$ is limited to be no larger than 100. In simulation, large values of $$K$$ are shown to successfully reconstruct both unimodal and multimodal underlying distributions. For example, when UMI counts are estimated under a single negative binomial distribution, the Dino fitted prior distribution (black, right panel) which is used to sample normalized expression closely matches the theoretical sampling distribution (red, right panel). Likewise, the fitted means ($$\mu_k$$ in the model, gray lines, left panel) span the range of the simulated data (heat map of counts, left panel), but concentrate around the theoretical mean of the sampling distribution (red, left panel).