QIAGEN Bioinformatics Manuals

Updating equations for the error rates

For the updating equations for the error probabilities, we consider a read,

, at a given site,

. The joint probability of the true nucleotide in the read,

, at the site being

and the data $n_1^h,...,n_{k_h}^h$ is:

$\displaystyle {P( r_i^h=N, n_1^h,...,n_{k_h}^h )}$
	$\displaystyle =$	$\displaystyle \sum_{s \in S} f_s P(r_i^h=N, n_1^h,...,n_{k_h}^h \vert s)$
	$\displaystyle =$	$\displaystyle \sum_{s \in S} f_s \prod_i P(r_i^h=N, n_i^h \vert s)$
	$\displaystyle =$	$\displaystyle \sum_{s \in S} f_s P(r_i^h=N, n_i^h \vert s) \prod_{j \neq i} P( n_j^h \vert s)$
	$\displaystyle =$	$\displaystyle \sum_{s \in S} f_s ( P_s(N) \times e_{q_{i^h}} (N \rightarrow n_i... ...A, C, G, T, -\}} P_s(N^{\prime}) \times e_{q_j}(N^{\prime} \rightarrow n_j^h) )$	(29.7)

Using Bayes formula again, as we did above in 29.4, we get:

$\displaystyle P( r_i^h=N \vert n_1^h,...,n_{k_h}^h )$	$\displaystyle =$	$\displaystyle \frac{P( r_i^h=N, n_1^h,...,n_k^h)}{P(n_1^h,...,n_{k_h}^h)}$
	$\displaystyle =$	$\displaystyle \frac{P( r_i^h=N, n_1^h,...,n_{k_h}^h)}{\sum_{N^{\prime} \in \{A, C, G, T, -\}} P( r_i^h=N^{\prime}, n_1^h,...,n_{k_h}^h)}$

and inserting the expression from equation 29.7:

$\displaystyle {P( r_i^h=N \vert n_1^h,...,n_{k_h}^h )}$
	$\displaystyle =$	$\displaystyle \frac{\sum_{s \in S} f(s) ( P_s(N) \times e_{q_{i^h}} (N \rightar... ... G, T, -\}} P_s(N^{\prime}) \times e_{q_j^h}(N^{\prime} \rightarrow n_j^h) )) }$
	$\displaystyle =$	$\displaystyle \frac{\sum_{s \in S} f(s) ( P_s(N) e_{q_{i^h}} (N \rightarrow n_i... ... C, G, T, -\}} P_s(N^{\prime}) \times e_{q_j^h}(N^{\prime} \rightarrow n_j^h) }$	(29.8)

The equation 29.9 gives us the probabilities for a given read, , and site, , given the data , that the true nucleotide is , $N \in \{A, C, G, T, -\}$ , given our current values of the error rates and site probabilities. Since we know the sequenced nucleotide in each read at each site, we can get new updated values for the error rate of producing an nucleotide when the true nucleotide is , $e^*_q(N \rightarrow M)$ , for $N,M \in \{A, C, G, T, -\}$ by summing the probabilities of the true nucleotide being for all reads across all sites for which the sequenced nucleotide is , and dividing by the sum of all probabilities of the true nucleotide being a across all reads and all sites:

$\displaystyle e^*_q(N \rightarrow M) = \frac{\sum_h \sum_{i=1,...,{k_h}: n_i^h ... ..._{k_h}^h)}{\sum_h \sum_{i=1,...,{k_h}} P(r_i^k = N \vert n_1^h,...,n_{k_h}^h)}$

Browse the manual

Updating equations for the error rates