The differential accessibility algorithm

The Differential Accessibility for Single Cell tool performs different types of tests for the different data types.

Peaks

As peaks are either present or not in a cell and their counts are not relevant, only the peak presence / absence is used when performing the differential aceessbility test.

The observed presence / absence is modeled using logistic regression. Let $ Y$ be the presence / absence of the peak and $ p = \mathbb{P}(Y = 1)$, then the form of the model for each peak is:

$\displaystyle \logit {  p} = \ln \frac{p}{1-p} = \beta_0 + \beta_1 g_i + \beta_2 \log_{10}{m_i}   ,$

where for cell $ i$, $ g_i$ denotes the group it belongs to, and $ m_i$ its total peak count. The total peak count is used as a proxy for the total sequencing depth of the cell.

Note that the logistic regression is applied in a pairwise fashion, where $ g_i$ is either 0 or $ 1$.

The probability that the peak is present in a specific group $ p_g = \mathbb{P}(Y_g = 1)$ is then estimated as

$\displaystyle \logit {  p_g} = \beta_0 + \beta_1 \boldsymbol{1}_{g = 1} + \beta_2 \overline{M}   ,$

where $ \boldsymbol{1}$ is the indicator function and $ \overline{M}$ is the average $ \log_{10}{m_i}$ over all cells.

The following are reported:

Nearby Genes and Transcription Factors

When comparing nearby genes or transcription factors, the count data is first normalized using a negative binomial (NB) generalized linear model.

The form of the model for each feature is:

$\displaystyle \log{\mathbb{E}(y_i)} = \beta_0 + \beta_1 \log_{10}{m_i}   ,$

where $ y_i$ are the observed counts for the feature for a cell $ i$. The dispersion parameter $ \gamma = 1/\theta$ of the NB distribution is estimated during fitting using the Cox-Reid penalized adjusted likelihood [Robinson et al., 2010]. When $ \gamma=0$ ( $ \theta = \infty$), the NB distribution reduces to the Poisson distribution.

To obtain the normalized values, the Pearson residuals are calculated as follows:

$\displaystyle z_i$ $\displaystyle = \frac{y_i - \exp{(\beta_0 + \beta_1 \log_{10}{m_i})}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sigma}$    
  $\displaystyle = \frac{y_i - \hat{y_i}}{\sqrt{\hat{y_i}(1+\gamma\hat{y_i})}}$    

The Pearson residuals are, however, difficult to interpret, and therefore the following is used for calculating average counts for each group:

$\displaystyle \log{\tilde{y_i}} = \beta_0 + \beta_1 \overline{M}   .$

The following are reported for pairwise comparisons:

Note that when identifying markers, the reported `Max group mean', `Fold change' and `P-value', regardless of the data type used for the test, are aggregated across all pairwise comparisons, as detailed in Differential Expression for Single Cell.

For more details on the outputs, see Interpreting the output of Differential Expression for Single Cell.