The Normalize Single Cell Data algorithm

The algorithm is based on sctransform [Hafemeister and Satija, 2019]. Briefly, a negative binomial (NB) generalized linear model (GLM) is fit to $ 2000$ genes, uniformly sampled across a range of expressions. The form of the model for each gene is:

$\displaystyle \log{\mathbb{E}(y_i)} = \beta_0 + \beta_1 \log_{10}{m_i}   ,$

where $ y_i$ are the observed counts for the gene for a cell $ i$ that has $ m_i$ total counts. The dispersion parameter $ \gamma = 1/\theta$ of the NB distribution is estimated during fitting using the Cox-Reid penalized adjusted likelihood [Robinson et al., 2010]. When $ \gamma=0$ ( $ \theta = \infty$) the NB distribution reduces to the Poisson distribution.

LOWESS regression is then used to estimate the intercept $ \beta_0$, the log-sequencing-depth coefficient $ \beta_1$, and the dispersion as a function of the average expression. The regression serves as a form of regularization that avoids over-fitting the model, which happens especially for low expression genes.

For batch correction there are some differences from both the above and from sctransform:

Normalized/batch corrected values are Pearson Residuals. For each gene, these are defined as follows:

$\displaystyle z_i$ $\displaystyle = \frac{y_i - \exp{(\beta_0 + \beta_1 \log_{10}{m_i})}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sqrt{\hat{y_i}(1+\gamma\hat{y_i})}}$    

Note that Pearson Residuals have several properties that may be unexpected. They are: