Calculation of the prior and error probabilities

The prior probabilities are estimated using only the mapped reads through four rounds of Expectation Maximization and are calculated for each potential combination of alleles (site types). Thus, the the prior probabilities reflect the likelihood of observing each combination of alleles in the genome studied. The reference sequence is not taken into account during the first part of the analysis. More about the Maximum Likelihood estimation (MLE) can be found at http://en.wikipedia.org/wiki/Maximum_likelihood.

For a diploid organism, the initial parameters for the priors, which are then updated, are shown in Table 26.1. The sum of the probabilities for all site types is always 1.


Table 26.1: Site Types for a diploid organism with example probabilities.
Site Type Prior probability
A/A 0.2475
A/C 0.001
A/G 0.001
A/T 0.001
T/C 0.001
T/G 0.001
T/T 0.2475
G/C 0.001
C/C 0.2475
G/G 0.2475
G/- 0.001
A/- 0.001
C/- 0.001
T/- 0.001


Error probabilities are calculated alongside the priors for each observed allele and assumed reference allele, before the reference sequence is incorporated into the analysis. Table 26.2 illustrates an example of the values calculated in an error probability matrix.


Table 26.2: Error probability matrix - observed allele versus assumed reference allele.
  A C G T -
A 0.90 0.025 0.025 0.025 0.025
C 0.025 0.90 0.025 0.025 0.025
G 0.025 0.025 0.90 0.025 0.025
T 0.025 0.025 0.025 0.90 0.025
- 0.025 0.025 0.025 0.025 0.90


If quality values are available, an error matrix is calculated for each quality value.