The model for the Fixed Ploidy Variant Detection tool

The statistical model for the Fixed Ploidy Variant Detection tool consists of a model for the the possible site types, $ S$, and their prior probabilities, $ f_s, s \in S$, and for the sequencing errors, $ e$.

Prior site type probabilities:
The set of possible site types is determined entirely by the assumed ploidy, and consists of the set of possible underlying nucleotide allele combinations that can exist within an individual with the specified number of alleles. E.g. if the specified ploidy is 2, the individual has two alleles, and the nucleotide at each allele can either be an $ A$, a $ C$, a $ G$, a $ T$ or a $ -$. The set of possible types for the diploid individual's sites is thus:

$\displaystyle S = \{ A/A, A/C, A/G, A/T, A/-, C/C, C/G, C/T, C/-, G/G, G/T, G/-, T/T, T/-, -/-\}.
$

Note that, as we cannot distinguish the alleles from each other there are not 5 $ \times$ 5 = 25 possible site types, but only 15 (that is, the allele combination $ A/C$ is indistinguishable from the allele combination $ C/A$).

We let $ f_s$ denote the prior probabilities of the site types $ s \in S$. The prior probabilities of the site types are the frequencies of the true site types in the mapping. The values of these are unknown, and need to be estimated from the data.

Error probabilities:
The model for the sequencing errors describes the probabilities with which the sequencing machine produces the nucleotide $ M$, when what it should have produced was the nucleotide $ N$, ($ M$ and $ N$ $ \in\{A, C, G, T, -\}$). When quality values are available for the nucleotides in the reads, we let each quality value have its own error model; if not, a single model is assumed for all nucleotides. Each error model has the following 25 parameters:

$\displaystyle \{e(N \rightarrow M) \vert N,M \in \{A, C, G, T, -\}\}.
$

The values of these parameters are also unknown, and hence also need to be estimated from the data.