The model for the Low Frequency Variant Caller

For the low frequency variant caller, we have a large number of possible hypotheses which can explain why the observed set of nucleotides are present in the sample at a given site. The most general situation is that the sample is a mixture of all variants corresponding to all four nucleotides and some missing nucleotides (gaps in the read mapping). We can write this hypothesis as [A/C/G/T/-]. At the other end of the spectrum, we have sites that are only made up of a single nucleotide (or only gaps in the read mapping). We can write these as [A], [C], etc. Each of these hypothesis can be described by a Multinomial model.

The hypotheses where a single nucleotide is present in the sample have no free parameters (there is just one frequency parameter and it must be 1). A hypothesis stating that a site is a mixture of two different nucleotides, e.g. [A/G] has one free parameter since there are frequencies for two nucleotides but they have to sum to one.