Calculation of estimated biological variation

Genes that have been normalized by Normalize Single Cell Data have an expected variance of $ \sim1$ from random noise. In reality many genes have larger variance because they do not perfectly fit the model used in normalization. This is expected because the model only expects expression to vary due to sequencing depth and (optionally) batch effects - it does not account for expressions differing across different cell types or treatments.

We define the `estimated biological variation' $ v_{\mathrm{bio}}$ in a normalized sample to be the fraction of the total variance that is above the expected variance due to random noise for each gene

$\displaystyle v_{\mathrm{bio}} = \frac{\sum_g \max(\mathrm{Var}(z_g) - 1,0)}{\sum_g (\mathrm{Var}(z_g))}.$

Here, $ z_g$ are the normalized expressions of gene $ g$. Note that this estimate assumes that all variation remaining after normalization is of `biological' origin. This is unlikely in practice, and the estimate will often be too high.