Updating the choice of favored Multinomial model for each site

Given a set of error rates, we can find the maximum likelihood estimates of the underlying frequencies for each possible hypothesis for a given site. This also gives us the maximum likelihood value that can be obtained for the site under that hypothesis. Since the hypotheses with few free parameters are special cases of hypotheses with more free parameters, the hypotheses with the most free parameters will also have the highest likelihoods.

We will only favor a hypothesis with many free parameters if it offers a significantly higher likelihood than a hypothesis with fewer parameters. Let us consider a simple case where we have a hypothesis, $ H_0$, with no free parameters and an alternative hypothesis, $ H_1$, which has one free parameter and contains $ H_0$ as a special case (the hypotheses are nested). We can calculate the log likelihood ratio:

$\displaystyle \frac{L(H_1)}{L(H_0)} $

If this ratio is high, we tend to prefer hypothesis $ H_1$ and if it is low (i.e. close to 1), we prefer $ H_0$. It turns out that twice the log likelihood ratio is often $ \chi^2$ distributed with a parameter given by the difference between the number of free parameters in the hypotheses, $ n$. In our example $ n=1$ so:

$\displaystyle 2 \log \frac{L(H_1)}{L(H_0)} \sim \chi^2(1) $

If we write $ c_n(p)$ for the inverse cumulative probability density function for a $ \chi^2(n)$ distribution evaluated at $ 1-p$, we get a cutoff value for when we prefer $ H_1$ over $ H_0$ at the significance level given by $ p$.

We generalize this to apply to any two Multinomial model hypothesis $ H_x$ and $ H_y$. For these two, calculate the values (where $ df$ is the degrees of freedom in a hypothesis):

$\displaystyle v_x = 2 \log L(H_x) - c_{df_x}(p) $

$\displaystyle v_y = 2 \log L(H_y) - c_{df_y}(p) $

(use $ c_0(p) = 0$ for zero degrees of freedom). We now prefer the hypothesis with the highest value of $ v$. When comparing a hypothesis with zero free parameters to another hypothesis, we get exactly the same results as with the log likelihood ratio approach.

We use this approach when comparing the many hypotheses that are present in the low frequency variant caller. For each one we calculate a $ v$ value as twice the log likelihood and subtract a cutoff value $ c$ which is based an the $ p$ value and the degrees of freedom for that hypothesis. We then choose the hypothesis with the highest $ v$ as the one that best describes the site in question.

For stringent $ p$ values (i.e. values close to zero) we tend to prefer hypotheses with few free parameters which means that more sites tend to be called as homozygous.

The approach used here is similar to the Akaike Information Criteria except that we have introduced a way to use a $ p$ value with the comparisons.