bayNorm 1.2.0
bayNorm has been submitted to Bioconductor, once it is accepted, it can be installed via:
library(BiocManager)
BiocManager::install("bayNorm")
Currently bayNorm can be installed via github:
library(devtools)
devtools::install_github("WT215/bayNorm")
The main function is bayNorm
which is a wrapper function of prior parameters estimation and normalized array or matrix generation.
Essential parameters for running bayNorm
are:
Data
: a SummarizedExperiment
object or matrix (rows: genes, columns: cells).BETA_vec
: a vector of probabilities which is of length equal to the number of cells.Conditions
: If Conditions
is provided, prior parameters will be estimated within each group of cells (we name this kind of procedure as “LL” procedure where “LL” stands for estimating both \(\mu\) and \(\phi\) locally). Otherwise, bayNorm applied “GG” procedure for estimating prior parameters (estimating both \(\mu\) and \(\phi\) globally).Prior_type
: Even if you have specified the Conditions
, you can still choose to estimate prior parameters across all the cells by setting Prior_type="GG"
.The input parameters BETA_vec
, Conditions
(if specified), UMI_sffl
(if specified), Prior_type
, FIX_MU
, BB_SIZE
and GR
are stored in the list input_params
which should be output from bayNorm
. Objects PRIORS
together with input_params
output from bayNorm
should be input in bayNorm_sup
for transforming between 3D array, mode or mean version’s output of normalized count matricies.
Apart from the raw data, another parameter which user may need to provide is the mean capture efficiency \(<\beta>\) and hence \(\beta\) can be further calculated using scaling factors estimated from any other methods. The default \(\beta\) is calculated to be total counts normalized to 6%. Or you can use BetaFun
in bayNorm to estimate \(\beta\):
data('EXAMPLE_DATA_list')
#Suppose the input data is a SummarizedExperiment object:
#Here we just used 30 cells for illustration
rse < SummarizedExperiment::SummarizedExperiment(assays=SimpleList(counts=EXAMPLE_DATA_list$inputdata[,seq(1,30)]))
#SingleCellExperiment object can also be input in bayNorm:
#rse < SingleCellExperiment::SingleCellExperiment(assays=list(counts=EXAMPLE_DATA_list$inputdata))
BETA_est<BetaFun(Data=rse,MeanBETA=0.06)
summary(BETA_est$BETA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02299 0.03859 0.05255 0.06000 0.07473 0.11168
The function BetaFun
selects a subset of genes such that outlier genes and genes with high proportion of zeros across cells are excluded. Then the total counts of the subset of genes in each cell are normalized to \(<\beta>\). Another way is to normalize scaling factors estimated from the R package scran
to \(<\beta>\).
BETA_vec
can be set to be NULL
, then the \(\beta\) are estimated to be total counts normalized to 6%.
#Return 3D array normalzied data, draw 20 samples from posterior distribution:
bayNorm_3D<bayNorm(
Data=rse,
BETA_vec = NULL,
mode_version=FALSE,
mean_version = FALSE,S=20
,verbose =FALSE,
parallel = TRUE)
##

  0%

=============  20%

===========================  40%

========================================  60%

======================================================  80%

=================================================================== 100%
##
##

  0%

===  5%

=======  10%

==========  15%

=============  20%

=================  25%

====================  30%

=======================  35%

===========================  40%

==============================  45%

==================================  50%

=====================================  55%

========================================  60%

============================================  65%

===============================================  70%

==================================================  75%

======================================================  80%

=========================================================  85%

============================================================  90%

================================================================  95%

=================================================================== 100%
#Return 2D matrix normalized data (MAP of posterior):
#Simply set mode_version=TRUE, but keep mean_version=FALSE
#Return 2D matrix normalized data (mean of posterior):
#Simply set mean_version=TRUE, but keep mode_version=FALSE
bayNorm’s mathematical model is suitable for UMI dataset. However it can be also applied on nonUMI dataset. In bayNorm
, you need to specify the following parameter:
* UMI_sffl
: bayNorm can also be applied on the nonUMI dataset. However, user need to provide a scaled number. Raw data will be divided by the scaled number and bayNorm will be applied on the rounded scaled data. By doing so, the Dropout vs Mean expression plots will be similar to that of UMI dataset.
If you have run bayNorm on a dataset before but want to output another kind of data (3D array or 2D matrix), you can use the function bayNorm_sup
. It is important to input the existing estimated parameters by specifying the following parameter in bayNorm_sup
:
BETA_vec
: If Conditions
has been specified previously, then input unlist(bayNorm_output$BETA)
PRIORS
: input bayNorm_output$PRIORS_LIST
Conditions
: make sure to specify the same Conditions
as before.
You can find these two objects from the previous output of bayNorm function, which is a list.#Now if you want to generate 2D matrix (MAP) using the same prior
#estimates as generated before:
bayNorm_2D<bayNorm_sup(
Data=rse,
PRIORS=bayNorm_3D$PRIORS,
input_params=bayNorm_3D$input_params,
mode_version=TRUE,
mean_version = FALSE,
verbose =FALSE)
#Or you may want to generate 2D matrix
#(mean of posterior) using the same prior
#estimates as generated before:
bayNorm_2D<bayNorm_sup(
Data=rse,
PRIORS=bayNorm_3D$PRIORS,
input_params=bayNorm_3D$input_params,
mode_version=FALSE,
mean_version = TRUE,
verbose =FALSE)
Prior_fun
: This is a wrapper function of EstPrior
, BB_Fun
and AdjustSIZE_fun
:
EstPrior
: Estimating priors using MME methods.BB_Fun
: Estimating priors by maximizing marginal likelihood function with respect to either \(\phi\) or both \(\mu\) and \(\phi\).AdjustSIZE_fun
: Adjusting \(\phi\) estimated from BB_Fun
.noisy_gene_detection
: This is a higher wrapper function over bayNorm
, bayNorm_sup
and SyntheticControl
. For details about the rationale behind this function, see the Methods part in the paper 2.
SyntheticControl
: Given a real scRNAseq data with estimated \(\beta\), it will estimate \(\mu\) and \(\phi\) from it and simulate control data using Poisson
distribution. See the Methods part in the paper 2 for more details.
#Downstream analysis: DE genes detection DE function used in bayNorm paper, which was kindly provided by the author of SCnorm.
library(MAST)
library(reshape2)
########SCnorm_runMAST3##########
SCnorm_runMAST3 < function(Data, NumCells) {
if(length(dim(Data))==2){
resultss<SCnorm_runMAST(Data, NumCells)
} else if(length(dim(Data))==3){
library(foreach)
resultss< foreach(sampleind=1:dim(Data)[3],.combine=cbind)%do%{
print(sampleind)
qq<SCnorm_runMAST(Data[,,sampleind], NumCells)
return(qq$adjpval)
}
}
return(resultss)
}
SCnorm_runMAST < function(Data, NumCells) {
NA_ind<which(rowSums(Data)==0)
Data = as.matrix(log2(Data+1))
G1<Data[,seq(1,NumCells[1])]
G2<Data[,seq(1,NumCells[1])]
qq_temp< rowMeans(G2)rowMeans(G1)
qq_temp[NA_ind]=NA
numGenes = dim(Data)[1]
datalong = melt(Data)
Cond = c(rep("c1", NumCells[1]*numGenes), rep("c2", NumCells[2]*numGenes))
dataL = cbind(datalong, Cond)
colnames(dataL) = c("gene","cell","value","Cond")
dataL$gene = factor(dataL$gene)
dataL$cell = factor(dataL$cell)
vdata = FromFlatDF(dataframe = dataL, idvars = "cell", primerid = "gene", measurement = "value", id = numeric(0), cellvars = "Cond", featurevars = NULL, phenovars = NULL)
zlm.output = zlm(~ Cond, vdata, method='glm', ebayes=TRUE)
zlm.lr = lrTest(zlm.output, 'Cond')
gpval = zlm.lr[,,'Pr(>Chisq)']
adjpval = p.adjust(gpval[,1],"BH") ## Use only pvalues from the continuous part.
adjpval = adjpval[rownames(Data)]
return(list(adjpval=adjpval, logFC_re=qq_temp))
}
#Now, detect DE genes between two groups of cells with 15 cells in each group respectively
#For 3D array
DE_out<SCnorm_runMAST3(Data=bayNorm_3D$Bay_out, NumCells=c(15,15))
med_pvals<apply(DE_out,1,median)
#DE genes are called with threshold 0.05:
DE_genes<names(which(med_pvals<0.05))
#For 2D array
DE_out<SCnorm_runMAST3(Data=bayNorm_2D$Bay_out, NumCells=c(15,15))
DE_genes<names(which(DE_out$adjpval<0.05))
A scRNAseq dataset is typically represented in a matrix of dimension \(P\times Q\), where P denotes the total number of genes observed and Q denotes the total number of cells studied. The element \(x_{ij}\) (\(i \in \{1,2,\ldots, P\}\) and \(j \in \{1,2,\ldots, Q\}\)) in the matrix represents the number of transcripts reported for the \(i^{\text{th}}\) gene in the \(j^{\text{th}}\) cell. This is equal to the total number of sequencing reads mapping to that gene in that cell for a nonUMI protocol. For UMI based protocols this is equal to the number of individual UMIs mapping to each gene. The matrix can include data from different groups or batches of cells, representing different biological conditions. This can be represented as a vector of labels for the cell groups or conditions (\(C_j\)). bayNorm generates for each gene (i) in each cell (j) a posterior distribution of original expression counts (\(x_{ij}^0\)), given the observed scRNAseq read count for that gene (\(x_{ij}\)).
A common approach for normalizing scRNAseq data is based on the use of a global scaling factor (\(s_j\)), ignoring any gene specific biases. The normalized data \(\tilde{x}_{ij}\) is obtained by dividing the raw data for each cell \(j\) by the its global scaling factor \(s_j\):
\[\begin{equation} \label{equ::scaling} \tilde{x}_{ij} = \frac{x_{ij}}{s_j} \end{equation}\]In bayNorm, we implement global scaling using a Bayesian approach. We assume given the original number of transcripts in the cell (\(x_{ij}^0\)), the number of transcripts observed (\(x_{ij}\)) follows a Binomial model with probability \(\beta_j\), which we refer to as capture effeiciency and it represents the probability of original transcripts in the cell to be observed. In addition, we assume that the original number or true count of the \(i^{\text{th}}\) gene in the \(j^{\text{th}}\) cell (\(x_{ij}^0\)) follows Negative Binomial distribution with parameters mean (\(\mu\)), size (or dispersion parameter, \(\phi\)), such that: \[\Pr(x^0_{ij}=n\phi_i,\mu_i)=\frac{\Gamma(n+\phi_i)}{\Gamma(\phi_i)n!}(\frac{\phi_i}{\mu_i+\phi_i})^{\phi_i}(\frac{\mu_i}{\mu_i+\phi_i})^{n}\]
So, overall we have the following model:
\[\begin{equation} \label{model:bay} \begin{split} &x_{ij}\sim \text{Binom}(x_{ij}^0,\text{prob}=\beta_j)\\ &x_{ij}^0 \sim \text{NB}(\text{mean}=\mu_i,\text{size}=\phi_i) \end{split} \end{equation}\]Using the Bayes rule, we have the following posterior distribution of original number of mRNAs for each gene in each cell:
\[\begin{equation} \underbrace{\Pr(x_{ij}^0x_{ij},\beta_j,\mu_i,\phi_i)}_\text{Posterior} = \dfrac{ \overbrace{\Pr(x_{ij}x_{ij}^0,\beta_j)}^\text{Likelihood} \times \overbrace{\Pr(x_{ij}^0\mu_i,\phi_i)}^\text{Prior}}{\underbrace{\Pr(x_{ij}\mu_i,\phi_i,\beta_j)}_\text{Marginal likelihood}} \end{equation}\]The prior parameters \(\mu\) and \(\phi\) of each gene were estimated using an empirical Bayesian method as discussed in detail in Supplementary Information of 1.
For more details about the rationale of bayNorm, please see 1.
Cell specific capture efficiency \(\beta_j\) and global scaling factor (\(s_j\)) are closely related. We can transform scaling factors estimated by different methods (see below) into \(\beta_j\) values with the following formula:
\[\begin{equation} \beta_j = (s_j/\bar{s})\bar{\beta} \end{equation}\]\(\bar{\beta}\) or \(<\beta>\), a scalar, is an estimate of global mean capture efficiency across all cells, which ranges between 0 and 1.
There are two different methods for estimating \(\bar{\beta}\) and \(\beta_j\):
We finally note, that estimates of capture efficiency discussed above will assume cells have simular original transcript content. Therefore, the bayNorm outputs estimates of original transcript counts for a typical cell, which is corrected for variation in cell size and transcript content. This is usually desirable for downstream analysis such as DE detection. However, if one is interested in absolute origianl count and has additional information such as cell size or total transcirpt content per cell, the capture efficiencies can be approporiatly rescaled for this purpose.
Using an emperical bayes approach, one can use the maximisation of marginal likelihood distribution of the observed counts across cells to estimate prior parameters. Let \(M_i\) denotes the marginal likelihood function for the \(i^\text{th}\) gene across cells. Assuming independence between cells, the logmarginal distribution for the \(i^\text{th}\) gene is
\[\begin{equation} \log M_i = \sum_{j=1}^Q \log \Pr(x_{ij}\mu_i,\phi_i,\beta_j), \end{equation}\]where \(\Pr(x_{ij}\mu_i,\phi_i,\beta_j)\) is the Negative Binomial (see Methods). Maximizing of this equation yields the pair \((\mu_i,\phi_i)\).
The above optimization needs to be done for each of the P genes. We refer to the \(\phi\) and/or \(\mu\) estimated by maximizing marginal distribution as BB estimates for convenience, because bayNorm utilizes spectral projected gradient method (spg) from the R package named ``BB’’. Optimizing the marginal distribution with respect to both \(\mu\) and \(\phi\) (2D optimization) is computationally intensive. If we had a good estimate \(\mu\), then we could optimize the marginal distribution with respect to \(\phi\) alone, which would be much more efficient.
A heuristic way of estimating \(\mu_i\) and \(\phi_i\) is through a variant of the Method of Moments. The first step is to do a simple normalization of the raw data, to scale expressions given the cell specific capture efficiencies (\(\beta_j\)). The simple normalized count \(x_{ij}^s\) is calculated as following:
\[\begin{equation} x_{ij}^s=x_{ij}\frac{\langle\sum_{i=1}^Px_{ij}/\beta_j\rangle_j}{\sum_{i=1}^P{x_{ij}}}, \end{equation}\]where the numerator of the scaling factor of \(x_{ij}\) is obtained by taking the average of scaled total counts across cells.
Based on simple normalized data, we are able to estimate prior parameters \(\mu\) and \(\phi\) of the Negative Binomial distribution using the Method of Moments Estimation (MME), which simply equates the theoretical and empirical moments. This estimation method is fast and simulations suggests it provides good estimates of \(\mu\) but the drawback is that the estimation of \(\phi\) show a systematic bias (see Supplementary Figure S24 ab in 1).
Based on simulation studies (Supplementary Figure S24 in 1), the most robust and efficient estimation of \(\mu\) and \(\phi\) can be obtained using the following combined approach, which is the default setting in bayNorm:
Cells are grouped together for prior estimation, based on cellspecific attributes (\(C_j\)). Prior estimation can be done over all cells irrespective of the experimental condition. We refer to this procedure as “global”. Alternatively, suppose that there are multiple groups of cells in the datasets and we have reasons to believe each group could behave differently. Then we can estimate the prior parameters “\(\mu\) and \(\phi\)” within each group respectively (within groups with the same \(C_j\) value). We refer to this procedure as “local”. Estimating prior parameters across a certain group of cells based on “global” procedure allow for removing potential batch effects. Multiple groups normalization based on “local” procedure allows for amplifying the intergroups’ differences while mitigating the intragroup’s variability, which is suitable for DE detection.
sessionInfo()
## R version 3.6.0 (20190426)
## Platform: x86_64pclinuxgnu (64bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs3.9bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs3.9bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
## [7] LC_PAPER=en_US.UTF8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SingleCellExperiment_1.6.0 SummarizedExperiment_1.14.0
## [3] DelayedArray_0.10.0 BiocParallel_1.18.0
## [5] matrixStats_0.54.0 Biobase_2.44.0
## [7] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [9] IRanges_2.18.0 S4Vectors_0.22.0
## [11] BiocGenerics_0.30.0 bayNorm_1.2.0
## [13] knitr_1.22 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 XVector_0.24.0 compiler_3.6.0
## [4] BiocManager_1.30.4 zlibbioc_1.30.0 bitops_1.06
## [7] iterators_1.0.10 tools_3.6.0 fitdistrplus_1.014
## [10] digest_0.6.18 evaluate_0.13 lattice_0.2038
## [13] doSNOW_1.0.16 Matrix_1.217 foreach_1.4.4
## [16] lsei_1.20 yaml_2.2.0 xfun_0.6
## [19] GenomeInfoDbData_1.2.1 stringr_1.4.0 locfit_1.59.1
## [22] grid_3.6.0 snow_0.43 survival_2.441.1
## [25] rmarkdown_1.12 bookdown_0.9 magrittr_1.5
## [28] codetools_0.216 htmltools_0.3.6 MASS_7.351.4
## [31] splines_3.6.0 npsurv_0.40 stringi_1.4.3
## [34] RCurl_1.954.12