QIAGEN Bioinformatics Manuals

The Normalize Single Cell Data algorithm

The algorithm is based on sctransform [Hafemeister and Satija, 2019]. Briefly, a negative binomial (NB) generalized linear model (GLM) is fit to

genes, uniformly sampled across a range of expressions. The form of the model for each gene is:

$\displaystyle \log{\mathbb{E}(y_i)} = \beta_0 + \beta_1 \log_{10}{m_i} ,$

where are the observed counts for the gene for a cell that has total counts. The dispersion parameter $\gamma = 1/\theta$ of the NB distribution is estimated during fitting using the Cox-Reid penalized adjusted likelihood [Robinson et al., 2010]. When $\gamma=0$ ( $\theta = \infty$ ) the NB distribution reduces to the Poisson distribution.

LOWESS regression is then used to estimate the intercept $\beta_0$ , the log-sequencing-depth coefficient $\beta_1$ , and the dispersion as a function of the average expression. The regression serves as a form of regularization that avoids over-fitting the model, which happens especially for low expression genes.

For batch correction there are some differences from both the above and from sctransform:

A GLM is fit to every gene, because any gene might have a batch effect - though genes with expression counts are ignored.
Batch effect terms are added to the model, and these cannot be regularized because each gene might have a batch effect very different from those of genes with similar expression.
LOWESS regression is only applied to the dispersion, because otherwise there is a mix of regularized and non-regularized terms that cannot be disentangled. The end result is, roughly speaking, that the price to pay for batch correction is a tendency to over-correct data. However, the more data there is, the less this will be a problem, and batch correction is typically performed on large amounts of data.

Normalized/batch corrected values are Pearson Residuals. For each gene, these are defined as follows:

$\displaystyle z_i$	$\displaystyle = \frac{y_i - \exp{(\beta_0 + \beta_1 \log_{10}{m_i})}}{\sigma}$
	$\displaystyle = \frac{y_i - \hat{y_i}}{\sigma}$
	$\displaystyle = \frac{y_i - \hat{y_i}}{\sqrt{\hat{y_i}(1+\gamma\hat{y_i})}}$

Note that Pearson Residuals have several properties that may be unexpected. They are:

decimals e.g. 123.4 rather than integers e.g. 123
negative when a gene in a cell has lower expression than predicted by the underlying GLM (though usually not very negative)
zero for all cells in the unlikely event that expression can be perfectly predicted by the GLM from the provided combination of sequencing depth and batch factors
only defined in the context of a data set - they cannot be compared across data sets. For example if we have three data sets A, B and C, then running the tool on (A + B) might say that a particular cell + gene in set A has a normalized expression of 100, while running the same tool on (A + C) might say that the same cell + gene has a normalized expression of 0.

Browse the manual

The Normalize Single Cell Data algorithm