Deriving the posterior probabilities of the site types
We will call a variant at a site if the sum of the posterior probabilities of the non-homozygous reference site types is larger than the user-specified cut-off value. For this we need to be able to calculate the posterior site type probabilities. We here derive the formula for these.
Using the Bayesian approach we can write the posterior probability of a site type, , as follows:
where
![$ P(t)$](img147.gif)
![$ t$](img142.gif)
![$ f_s, s \in S$](img127.gif)
![$ P(data\vert t)$](img148.gif)
![$ t$](img142.gif)
![$ k$](img149.gif)
![$ i$](img19.gif)
![$ n_i$](img150.gif)
![$\displaystyle P(data\vert t) = P(n_1,...,n_k\vert t). $](img151.gif)
To derive the likelihood of the data,
, we first need some notation: For a given site type,
, let
be the probability that an allele from this site type has the nucleotide
. The
probabilities are known and are determined by the ploidy: For a diploid organism,
if
is a homozygous site and
is one of the alleles in
, whereas it is 0.5 if
is a heterozygous and
is one of the alleles in
, and it is 0, if
is not one of the alleles in
. For a triploid organism, the
will be either 0, 1/3, 2/3 or 1.
With this definition, we can write the likelihood of the data
in a site
as:
Inserting this expression for the likelihood, and the prior site type frequencies and
for
and
, in the expression for the posterior probability (31.1), we thus have the following equation for the posterior probabilities of the site types:
The unknowns in this equation are the prior site type probabilities,
, and the error rates
. Once these have been estimated, we can calculate the posterior site type probabilities using the equation 31.3 for each site type, and hence, for each site, evaluate whether the sum of the posterior probabilities of the non-homozygous reference site types is larger than the cut-off. If so, we will set out current estimated site type to be that with the highest posterior probability.