Deriving the posterior probabilities of the site types

We will call a variant at a site if the sum of the posterior probabilities of the non-homozygous reference site types is larger than the user-specified cut-off value. For this we need to be able to calculate the posterior site type probabilities. We here derive the formula for these.

Using the Bayesian approach we can write the posterior probability of a site type, $ t$, as follows:

$\displaystyle P(t\vert data)$ $\displaystyle =$ $\displaystyle \frac{P(data\vert t)P(t)}{P(data)}$  
  $\displaystyle =$ $\displaystyle \frac{P(data\vert t)P(t)}{\sum_{s \in S}P(data\vert s)P(s)},$ (21.11)

where $ P(t)$ is the prior probability of site type $ t$ (that is, $ f_s, s \in S$, from above) and $ P(data\vert t)$ is the likelihood of the data, given the site type $ t$. The data consists of all the nucleotides in all the reads in the mapping. For a particular site, assume that we have $ k$ reads that cover this site, and let $ i$ be an index over the nucleotides observed, $ n_i$, in the reads at this site. We thus have:

$\displaystyle P(data\vert t) = P(n_1,...,n_k\vert t). $

To derive the likelihood of the data, $ P(n_1,...,n_k\vert t)$, we first need some notation: For a given site type, $ t$, let $ P_t(N)$ be the probability that an allele from this site type has the nucleotide $ N$. The $ P_t(N)$ probabilities are known and are determined by the ploidy: For a diploid organism, $ P_t(N)=1$ if $ t$ is a homozygous site and $ N$ is one of the alleles in $ t$, whereas it is 0.5 if $ t$ is a heterozygous and $ N$ is one of the alleles in $ t$, and it is 0, if $ N$ is not one of the alleles in $ t$. For a triploid organism, the $ P_t(N)$ will be either 0, 1/3, 2/3 or 1.

With this definition, we can write the likelihood of the data $ n_1,...,n_k$ in a site $ t$ as:

$\displaystyle P(n_1,...,n_k \vert t) = \prod_{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_t(N) \times e_q(N \rightarrow n_i).$ (21.12)

Inserting this expression for the likelihood, and the prior site type frequencies $ f_s$ and $ f_t$ for $ P(s)$ and $ P(t)$, in the expression for the posterior probability (21.11), we thus have the following equation for the posterior probabilities of the site types:


$\displaystyle P(t\vert n_1,...,n_k)$ $\displaystyle =$ $\displaystyle \frac{P(n_1,...,n_k\vert t)f_t}{\sum_{s \in S}P(n_1,...,n_k\vert s)f_s}$  
  $\displaystyle =$ $\displaystyle \frac{\prod_{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_t(N) \times e...
..._{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_s(N) \times e_q(N \rightarrow n_i)f_s}$ (21.13)

The unknowns in this equation are the prior site type probabilities, $ f_s, s \in S$, and the error rates $ \{e(N \rightarrow M) \vert N,M \in \{A, C, G, T, -\}\}$. Once these have been estimated, we can calculate the posterior site type probabilities using the equation 21.13 for each site type, and hence, for each site, evaluate whether the sum of the posterior probabilities of the non-homozygous reference site types is larger than the cut-off. If so, we will set out current estimated site type to be that with the highest posterior probability.