The Normalize Single Cell Data algorithm

The algorithm is based on sctransform v2 [Hafemeister and Satija, 2019,Choudhary and Satija, 2022]. Briefly:

The form of the GLM is:

$\displaystyle \log{\mathbb{E}(y_i)} = \beta_0 + \ln{m_i}   ,$

where $ \beta_0$ is the intercept and $ y_i$ is the observed expression for the gene for a cell $ i$ that has total expression $ m_i$.

The dispersion parameter $ \gamma = 1/\theta$ of the NB distribution is estimated using the Cox-Reid penalized adjusted likelihood [Robinson et al., 2010]. The NB distribution reduces to the Poisson distribution when $ \gamma=0$ ( $ \theta = \infty$).

LOWESS regression is used to estimate $ \beta_0$ and $ \gamma$ as a function of the average expression. This acts as a form of regularization, preventing over-fitting, particularly for genes with low expression levels.

The algorithm is adjusted as follows when batch correction is applied:

Normalized values are Pearson residuals, representing the portion of expression that is not explained by the model fit. For each gene, these are defined as follows:

$\displaystyle z_i$ $\displaystyle = \frac{y_i - \exp{(\beta_0 + \ln{m_i})}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sqrt{\hat{y_i}(1+\gamma\hat{y_i})}}$    

Note that Pearson residuals have the following properties that may be unexpected. They are: