Calculation of the prior and error probabilities

The prior probabilities are estimated using only the mapped reads through four rounds of Expectation Maximization and are calculated for each potential combination of alleles (site types). Thus, the prior probabilities reflect the likelihood of observing each combination of alleles in the genome studied. The reference sequence is not taken into account during the first part of the analysis. More about the Maximum Likelihood estimation (MLE) can be found at http://en.wikipedia.org/wiki/Maximum_likelihood.

For a diploid organism, the initial parameters for the priors, which are then updated, are shown in Table 35.1. The sum of the probabilities for all site types is always 1.


Table 35.1: Site Types for a diploid organism with example probabilities.
Site Type Prior probability
A/A 0.2475
A/C 0.001
A/G 0.001
A/T 0.001
T/C 0.001
T/G 0.001
T/T 0.2475
G/C 0.001
C/C 0.2475
G/G 0.2475
G/- 0.001
A/- 0.001
C/- 0.001
T/- 0.001


If the expected ploidy level is set to 1, analogous values to table 35.1 are calculated. Here, only the values for the homozygous site types like A, C, G, T and - would be calculated.

If the expected ploidy is set to 3, the analogous values are calculated, which here would be values for site types like A|A|A, A|C|G, G|G|- and so on.

Error probabilities are calculated alongside the priors for each observed allele and assumed reference allele, before the reference sequence is incorporated into the analysis. Table 35.2 illustrates an example of the values calculated in an error probability matrix.


Table 35.2: Error probability matrix - observed sequenced nucleotide in read versus actual nucleotide at this position.
  A C G T -
A 0.90 0.025 0.025 0.025 0.025
C 0.025 0.90 0.025 0.025 0.025
G 0.025 0.025 0.90 0.025 0.025
T 0.025 0.025 0.025 0.90 0.025
- 0.025 0.025 0.025 0.025 0.90


If quality values are available, an error matrix is calculated for each quality value.