QIAGEN Bioinformatics Manuals

The differential accessibility algorithm

The Differential Accessibility for Single Cell tool performs different types of tests for the different data types.

Peaks

As peaks are either present or not in a cell and their counts are not relevant, only the peak presence / absence is used when performing the differential aceessbility test.

The observed presence / absence is modeled using logistic regression. Let be the presence / absence of the peak and $p = \mathbb{P}(Y = 1)$ , then the form of the model for each peak is:

$\displaystyle \logit { p} = \ln \frac{p}{1-p} = \beta_0 + \beta_1 g_i + \beta_2 \log_{10}{m_i} ,$

where for cell , denotes the group it belongs to, and its total peak count. The total peak count is used as a proxy for the total sequencing depth of the cell.

Note that the logistic regression is applied in a pairwise fashion, where is either 0 or .

The probability that the peak is present in a specific group $p_g = \mathbb{P}(Y_g = 1)$ is then estimated as

$\displaystyle \logit { p_g} = \beta_0 + \beta_1 \boldsymbol{1}_{g = 1} + \beta_2 \overline{M} ,$

where $\boldsymbol{1}$ is the indicator function and $\overline{M}$ is the average $\log_{10}{m_i}$ over all cells.

The following are reported:

Max group mean. The maximum of the two estimated probabilities.
Fold change. The ratio between the two estimated probabilities.
P-value. The p-value that $\beta_1 \neq 0$ .

Nearby Genes and Transcription Factors

When comparing nearby genes or transcription factors, the count data is first normalized using a negative binomial (NB) generalized linear model.

The form of the model for each feature is:

$\displaystyle \log{\mathbb{E}(y_i)} = \beta_0 + \beta_1 \log_{10}{m_i} ,$

where are the observed counts for the feature for a cell . The dispersion parameter $\gamma = 1/\theta$ of the NB distribution is estimated during fitting using the Cox-Reid penalized adjusted likelihood [Robinson et al., 2010]. When $\gamma=0$ ( $\theta = \infty$ ), the NB distribution reduces to the Poisson distribution.

To obtain the normalized values, the Pearson residuals are calculated as follows:

$\displaystyle z_i$	$\displaystyle = \frac{y_i - \exp{(\beta_0 + \beta_1 \log_{10}{m_i})}}{\sigma}$
	$\displaystyle = \frac{y_i - \hat{y_i}}{\sigma}$
	$\displaystyle = \frac{y_i - \hat{y_i}}{\sqrt{\hat{y_i}(1+\gamma\hat{y_i})}}$

The Pearson residuals are, however, difficult to interpret, and therefore the following is used for calculating average counts for each group:

$\displaystyle \log{\tilde{y_i}} = \beta_0 + \beta_1 \overline{M} .$

The following are reported for pairwise comparisons:

Max group mean. The maximum of the average $\tilde{y_i}$ of the two groups.
Fold change. The ratio between the average $\tilde{y_i}$ of the two groups.
P-value. The p-value obtained from a Mann-Whitney U test (also known as Wilcoxon rank-sum test) on the Pearson residuals.

Note that when identifying markers, the reported 'Max group mean', 'Fold change' and 'P-value', regardless of the data type used for the test, are aggregated across all pairwise comparisons, as detailed in Differential Expression for Single Cell.

For more details on the outputs, see Interpreting the output of Differential Expression for Single Cell.

Browse the manual

The differential accessibility algorithm

Peaks

Nearby Genes and Transcription Factors