1 Introduction


PsiNorm is a scalable between-sample normalization for single cell RNA-seq count data based on the power-law Pareto type I distribution. It can be demonstrated that the Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate can be obtained for each cell independently. PsiNorm computes the shape parameter for each cellular sample and then uses it as multiplicative size factor to normalize the data. The final goal of the transformation is to align the gene expression distribution especially for those genes characterised by high expression. Note that, similar to other global scaling methods, our method does not remove batch effects, which can be dealt with downstream tools.

To evaluate the ability of PsiNorm to remove technical bias and reveal the true cell similarity structure, we used both an unsupervised and a supervised approach. We first simulate a scRNA-seq experiment with four known clusters using the splatter Bioconductor package. Then in the unsupervised approach, we i) reduce dimentionality using PCA, ii) identify clusters using the clara partitional method and then we iii) computed the Adjusted Rand Index (ARI) to compare the known and the estimated partition.

In the supervised approach, we i) reduce dimentionality using PCA, and we ii) compute the silhouette index of the known partition in the reduced dimensional space.

2 Citation

If you use PsiNorm in publications, please cite the following article:

Borella, M., Martello, G., Risso, D., & Romualdi, C. (2021). PsiNorm: a scalable normalization for single-cell RNA-seq data. bioRxiv.

3 Data Simulation

We simulate a matrix of counts with 2000 cellular samples and 10000 genes with splatter.

params <- newSplatParams()
sce <- splatSimulateGroups(params, batchCells=N, lib.loc=12,
                           group.prob = rep(0.25,4),
                           de.prob = 0.2, de.facLoc = 0.06,
                           verbose = FALSE) 

sce is a SingleCellExperiment object with a single batch and four different cellular groups.

To visualize the data we used the first two Principal Components estimated starting from the raw log-count matrix.

assay(sce, "lograwcounts") <- log1p(counts(sce))
sce <- runPCA(sce, exprs_values="lograwcounts", scale=TRUE, ncomponents = 2)
plotPCA(sce, colour_by="Group")