QIAGEN Bioinformatics Manuals

Deriving the posterior probabilities of the site types

We will call a variant at a site if the sum of the posterior probabilities of the non-homozygous reference site types is larger than the user-specified cut-off value. For this we need to be able to calculate the posterior site type probabilities. We here derive the formula for these.

Using the Bayesian approach we can write the posterior probability of a site type, , as follows:

$\displaystyle P(t\vert data)$	$\displaystyle =$	$\displaystyle \frac{P(data\vert t)P(t)}{P(data)}$
	$\displaystyle =$	$\displaystyle \frac{P(data\vert t)P(t)}{\sum_{s \in S}P(data\vert s)P(s)},$	(28.1)

where

is the prior probability of site type

(that is, $f_s, s \in S$ , from above) and $P(data\vert t)$ is the likelihood of the data, given the site type

. The data consists of all the nucleotides in all the reads in the mapping. For a particular site, assume that we have

reads that cover this site, and let

be an index over the nucleotides observed,

, in the reads at this site. We thus have:

$\displaystyle P(data\vert t) = P(n_1,...,n_k\vert t).$

To derive the likelihood of the data, $P(n_1,...,n_k\vert t)$ , we first need some notation: For a given site type, , let be the probability that an allele from this site type has the nucleotide . The probabilities are known and are determined by the ploidy: For a diploid organism, if is a homozygous site and is one of the alleles in , whereas it is 0.5 if is a heterozygous and is one of the alleles in , and it is 0, if is not one of the alleles in . For a triploid organism, the will be either 0, 1/3, 2/3 or 1.

With this definition, we can write the likelihood of the data in a site as:

$\displaystyle P(n_1,...,n_k \vert t) = \prod_{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_t(N) \times e_q(N \rightarrow n_i).$

(28.2)

Inserting this expression for the likelihood, and the prior site type frequencies and for and , in the expression for the posterior probability (28.1), we thus have the following equation for the posterior probabilities of the site types:

$\displaystyle P(t\vert n_1,...,n_k)$	$\displaystyle =$	$\displaystyle \frac{P(n_1,...,n_k\vert t)f_t}{\sum_{s \in S}P(n_1,...,n_k\vert s)f_s}$
	$\displaystyle =$	$\displaystyle \frac{\prod_{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_t(N) \times e... ..._{i=1}^k \sum_{N \in \{A, C, G, T, -\}}P_s(N) \times e_q(N \rightarrow n_i)f_s}$	(28.3)

The unknowns in this equation are the prior site type probabilities, $f_s, s \in S$ , and the error rates $\{e(N \rightarrow M) \vert N,M \in \{A, C, G, T, -\}\}$ . Once these have been estimated, we can calculate the posterior site type probabilities using the equation 28.3 for each site type, and hence, for each site, evaluate whether the sum of the posterior probabilities of the non-homozygous reference site types is larger than the cut-off. If so, we will set out current estimated site type to be that with the highest posterior probability.

Browse the manual

Deriving the posterior probabilities of the site types