Updating equations for the error rates

Consider a site $ h$ and a read $ i$. The joint probability of the true nucleotide in the read, $ r_i^h$, at the site being $ N$ and the observed nucleotide at the site $ n_i^h$ is:

$\displaystyle P(r_i^h=N, n_i^h)$ $\displaystyle =$ $\displaystyle P_h(N) e_{q_i^h}(N \rightarrow n_i^h).$ (22.24)

Using Bayes Theorem, the probability of the true nucleotide in the read, $ r_i^h$, at the site being $ N$, given that we observe $ n_i^h$ is:

$\displaystyle P(r_i^h=N \vert n_i^h)$ $\displaystyle =$ $\displaystyle \frac{ P(r_i^h=N, n_i^h)}{ \sum_{N^{\prime} \in A,C,G,T,-} P(r_i^h=N^{\prime}, n_i^h)}.$ (22.25)

Inserting 22.25 in 22.26 we get:

$\displaystyle P(r_i^h=N \vert n_i^h)$ $\displaystyle =$ $\displaystyle \frac{ P_h(N) e_{q_i^h}(N \rightarrow n_i^h)}{ \sum_{N^{\prime} \in A,C,G,T,-} P_h(N^{\prime}) e_{q_i^h}(N^{\prime} \rightarrow n_i^h)}.$ (22.26)

The equation 22.27 gives us the probabilities for a given read, $ i$, and site, $ h$, given the observed nucleotide $ n_i^h$, that the true nucleotide is $ N$, $ N \in \{A, C, G, T, -\}$, given our current values for the frequency $ f$ (inserted for $ P_h(N)$) and error rates. Since we know the sequenced nucleotide in each read at each site, we can get new updated values for the error rate of producing an $ M$ nucleotide when the true nucleotide is $ N$, $ e^*_q(N \rightarrow M)$, for $ N,M \in \{A, C, G, T, -\}$ by summing the probabilities of the true nucleotide being $ N$ for all reads across all sites for which the sequenced nucleotide is $ M$, and dividing by the sum of all probabilities of the true nucleotide being a $ N$ across all reads and all sites:

$\displaystyle e^*_q(N \rightarrow M) = \frac{\sum_h \sum_{i=1,...,{k_h}: n_i^h ...
...(r_i^k = N \vert n_i^h)}{\sum_h \sum_{i=1,...,{k_h}} P(r_i^k = N \vert n_i^h)} $