CLC Manuals - clcsupport.com

Updating equations for the error rates

For the updating equations for the error probabilities we use the same procedure as for the Fixed Ploidy variant caller: We will use equation 27.7, however, in the case of the Low Frequency variant caller we do not have prior site type probabilities but instead have Maximum Likelihood estimates of the fractions of each nucleotide at the sites under the currently favored Multinomial model for that site. Let be the frequency of the nucleotide at site as determined by our current Maximum Likelihood estimates for these. Consider a read, , at a given site, . The joint probability of the true nucleotide in the read, , at the site being and the data $n_1^h,...,n_{k_h}^h$ :

$\displaystyle {P( r_i^h=N, n_1^h,...,n_{k_h}^h )}$
	$\displaystyle =$	$\displaystyle P_h(N) \prod_i P(r_i^h=N, n_i^h)$
	$\displaystyle =$	$\displaystyle P_h(N) P(r_i^h=N, n_i^h) \prod_{j \neq i} P( n_j^h)$
	$\displaystyle =$	$\displaystyle P_h(N) \times e_{q_{i^h}} (N \rightarrow n_i^h) \prod_{j \neq i} ... ...A, C, G, T, -\}} P_h(N^{\prime}) \times %e_{q_j}(N^{\prime} \rightarrow n_j^h)$	(27.9)

As for the Fixed Ploidy variant caller we use Bayes formula to get:

$\displaystyle P( r_i^h=N \vert n_1^h,...,n_{k_h}^h )$

$\displaystyle =$

$\displaystyle \frac{P( r_i^h=N, n_1^h,...,n_{k_h}^h)}{\sum_{N^{\prime} \in \{A, C, G, T, -\}} P( r_i^h=N^{\prime}, n_1^h,...,n_{k_h}^h)}$

and insert the expression from equation 27.10 to get:

$\displaystyle {P( r_i^h=N \vert n_1^h,...,n_{k_h}^h )}$
	$\displaystyle =$	$\displaystyle \frac{P_h(N) \times e_{q_{i^h}} (N \rightarrow n_i^h) \prod_{j \n... ...\{A, C, G, T, -\}} P_h(N{\prime}) \times e_{q_j}(N^{\prime} \rightarrow n_j^h)}$
	$\displaystyle =$	$\displaystyle \frac{P_h(N) \times e_{q_{i^h}} (N \rightarrow n_i^h) \prod_{j \n... ...\{A, C, G, T, -\}} P_h(N{\prime}) \times e_{q_j}(N^{\prime} \rightarrow n_j^h)}$	(27.10)

We calculate the $P( r_i^h=N \vert n_1^h,...,n_{k_h}^h )$ values in equation 27.12, using our current error estimates and our current Maximum likelihood values of the frequencies for our current multinomial models of choice at each of the sites. We then use these to get new updated values for the error rates in a manner similar to that of the Fixed Ploidy Variant Caller we use the equation:

$\displaystyle e^*_q(N \rightarrow M) = \frac{\sum_h \sum_{i=1,...,{k_h}: n_i^h ... ..._{k_h}^h)}{\sum_h \sum_{i=1,...,{k_h}} P(r_i^k = N \vert n_1^h,...,n_{k_h}^h)}$

Browse the manual

Updating equations for the error rates