The model for the Low Frequency Variant Caller

The purpose of the Low Frequency Variant Caller is to call variants in samples with unknown ploidy (e.g. samples of limited tumor purity from cancer patients) using read mapping data. It will detect germline as well as somatic variants, and may also be used on samples from other high-ploidy organisms, or pooled samples. Like the Fixed Ploidy Variant Caller it detects Single Nucleotide Variants (SNVs), MNVs (Multiple Nucleotide Variants), insertions, deletions as well as replacements (combinations of neighboring insertions, deletions and SNVs for which the positions are ambiguous).

The algorithm behind the The Low Frequency Variant Caller relies on Multinomial models for the presence of different nucleotide alleles at a given site and an error model for the sequencing (the error model is identical to that of the Fixed Ploidy variant caller). The Multinomial models are of the kind "there are $ q$ different nucleotide alleles present at the site with frequencies $ f_i$, $ i=1,...,q$, $ \sum_{i=1}^q f_i=1$", where the number of alleles, $ q$, differ. The models that are evaluated at each site are given in Table 20.1.


Table 20.1: The Multinomial models evaluated at each site. $ X, Y, Z, W$ and $ V$ each take on one of the values $ A,C,G,T$, or$ -$ ( $ X \neq Y \neq Z \neq W \neq V$). Free parameters$ ^*$: the parameters that are free in each of the Multinomial models of the Low Frequency Variant Caller.
Model Alleles present at the site Description Free parameters$ ^*$
$ M_x$ $ x$ the only allele present at the site is $ x$. none
$ M_{x,y}$ $ x$ and $ y$ $ x$ is present at frequency $ 1-f$, y at frequency $ f$ $ f$
$ M_{x,y,z}$ $ x$, $ y$ and $ z$ $ x$ is present at frequency $ 1-(f_1+f_2)$, y at frequency $ f_1$ and $ z$ at frequency $ f_2$ $ f_1$ and $ f_2$
$ M_{x,y,z,w}$ $ x$, $ y$, $ z$ and $ w$ $ x$ is present at frequency $ 1-(f_1+f_2+f_3)$, y at frequency $ f_1$, $ z$ at frequency $ f_2$ and $ w$ at frequency $ f_3$ $ f_1$, $ f_2$ and $ f_3$
$ M_{x,y,z,w, v}$ $ x$, $ y$, $ z$, $ w$ and $ v$ $ x$ is present at frequency $ 1-(f_1+f_2+f_3)$, y at frequency $ f_1$, $ z$ at frequency $ f_2$, $ w$ at frequency $ f_3$ and $ w$ at frequency $ f_4$ $ f_1$, $ f_2$, $ f_3$ and $ f_4$


In words, model $ M_x$ can be described as: "There is really only the X nucleotide allele present at the site, all other nucleotides are due to errors" and model $ M_{x,y,z}$ as: "There are really only the nucleotide alleles $ X$, $ Y$ and $ Z$ present at the site, all other nucleotides are due to errors". The hypotheses where a single nucleotide is present in the sample have no free parameters (there is just one frequency parameter and it must be 1). A hypothesis stating that a site is a mixture of two different nucleotides, e.g. [A/G] has one free parameter since there are frequencies for two nucleotides but they have to sum to one.

Parameter estimation relies on the Maximum Likelihood principle, and, as the Fixed Ploidy Variant Caller, the EM algorithm is used for estimating the parameters of the model. Given an initial set of parameter values for the error rates, the different Multinomial models are evaluated at each site by finding the maximum likelihood estimates of the frequency parameters for each model. The model that offers the best explanation of the data (while taking care to adjust for the numbers of parameters in the Multinomial model, using a criterion adopted from the Akaike Information criterion) is chosen as the current guess of the true allelic situation at that site, and given that, the error rates are re-estimated. Given the new error estimates, the Maximum Log Likelihoods for all possible Multinomial models are again evaluated and updated frequencies are produced. This procedure is performed a total of four times. After the final round of estimation the Multinomial model that offers the best explanation of the data is chosen as the winning model, and variants are called according to that model.

Below we describe in detail how we choose among competing models and derive the updating equations for the EM estimation of the frequency and error rate parameters.