Updating the choice of favored Multinomial model for each site

Given a set of error rates, and, for each site, a set of Maximum Likelihood estimates of the nucleotide allele frequencies for each Multinomial model, we can calculate the Maximum loglikelihood values for each of the models at each site. Since the hypotheses with fewer free parameters are special cases of hypotheses with more free parameters, the hypotheses with the most free parameters will necessarily have the highest likelihoods. We wish to only favor a model with more parameters over one with fewer, if it offers a significantly better explanation of the data at the site. In the simple case where we have nested models (e.g. a hypothesis, $ H_0$, with no free parameters and an alternative hypothesis, $ H_1$, which has one free parameter and contains $ H_0$ as a special case) it is a well known result that twice the loglikehood ratio is approximately $ \chi^2$ distributed with a parameter that is equal to the difference between the number of free parameters in the hypotheses, $ n$:

$\displaystyle 2 \log \frac{L(H_1)}{L(H_0)} \sim \chi^2(n). $

If we write $ c_n(p)$ for the inverse cumulative probability density function for a $ \chi^2(n)$ distribution evaluated at $ 1-p$, we get a cutoff value for when we prefer $ H_1$ over $ H_0$ at the significance level given by $ p$.

We wish to compare all models together and some of these will not be nested. We therefore generalize this approach to apply to any set of Multinomial model hypothesis $ H_m$, $ m = 1 ...,M$. For each model we calculate the value:


$\displaystyle v_m$ $\displaystyle =$ $\displaystyle 2 \log L(H_m) - c_{df_m}(p)$ (21.19)

where $ df_m$ is the number of free parameters in hypothesis $ H_m$. We now prefer the hypothesis with the highest value of $ v$ (note that when comparing a hypothesis with zero free parameters to another hypothesis, we get exactly the same results as with the log likelihood ratio approach). If this hypothesis is that only the reference allele is present at the site, no variants are called. For other models, the variants corresponding to the alleles hypothesized by the models are called.

The approach applied is an extension of the Akaike approach: In the Akaike approach you subtract twice the number of parameters, whereas in the approach that we apply, we subtract $ c_{df_x}(p)$, whereby additional parameters are punished relative to the significance level, $ p$, rather than in a fashion that is linear in the number of parameters. This means that the more parameters that are present, the more reluctant we will be to add even more.