The model for the Low Frequency Variant Detection tool

The purpose of the Low Frequency Variant Detection tool is to call variants in samples with unknown ploidy (e.g., samples of limited tumor purity) using read mapping data. It will detect germline as well as somatic variants, and may also be used on samples from other high-ploidy organisms, or pooled samples. Like the Fixed Ploidy Variant Detection tool it detects Single Nucleotide Variants (SNVs), MNVs (Multiple Nucleotide Variants), insertions, deletions as well as replacements (combinations of neighboring insertions, deletions and SNVs for which the positions are ambiguous).

The Low Frequency Variant Detection algorithm relies on multinomial models to determine the presence of different alleles at a given site and an error model to account for sequencing error. The error model employed here is the same as is used in the Fixed Ploidy Variant Detection tool, described in the Fixed Ploidy Variant Detection model section.

The multinomial models are of the kind "there are $ q$ different nucleotide alleles present at the site with frequencies $ f_i$, $ i=1,...,q$, $ \sum_{i=1}^q f_i=1$", where the number of alleles, $ q$, differ. The models that are evaluated at each site are given in Table 28.1.


Table 28.1: The multinomial models evaluated at each site. $ X, Y, Z, W$ and $ V$ each take on one of the values $ A,C,G,T$, or$ -$ ( $ X \neq Y \neq Z \neq W \neq V$). Free parameters$ ^*$: the parameters that are free in each of the multinomial models of the Low Frequency Variant Detection tool.
Model Alleles present at the site Description Free parameters$ ^*$
$ M_x$ $ x$ the only allele present at the site is $ x$. none
$ M_{x,y}$ $ x$ and $ y$ $ x$ is present at frequency $ 1-f$, y at frequency $ f$ $ f$
$ M_{x,y,z}$ $ x$, $ y$ and $ z$ $ x$ is present at frequency $ 1-(f_1+f_2)$, y at frequency $ f_1$ and $ z$ at frequency $ f_2$ $ f_1$ and $ f_2$
$ M_{x,y,z,w}$ $ x$, $ y$, $ z$ and $ w$ $ x$ is present at frequency $ 1-(f_1+f_2+f_3)$, y at frequency $ f_1$, $ z$ at frequency $ f_2$ and $ w$ at frequency $ f_3$ $ f_1$, $ f_2$ and $ f_3$
$ M_{x,y,z,w, v}$ $ x$, $ y$, $ z$, $ w$ and $ v$ $ x$ is present at frequency $ 1-(f_1+f_2+f_3)$, y at frequency $ f_1$, $ z$ at frequency $ f_2$, $ w$ at frequency $ f_3$ and $ w$ at frequency $ f_4$ $ f_1$, $ f_2$, $ f_3$ and $ f_4$


In words, model $ M_x$ can be described as: "There is really only the X nucleotide allele present at the site, all other nucleotides are due to errors" and model $ M_{x,y,z}$ as: "There are really only the nucleotide alleles $ X$, $ Y$ and $ Z$ present at the site, all other nucleotides are due to errors". The hypotheses where a single nucleotide is present in the sample have no free parameters (there is just one frequency parameter and it must be 1). A hypothesis stating that a site is a mixture of two different nucleotides, e.g. [A/G] has one free parameter since there are frequencies for two nucleotides but they have to sum to one.

Parameter estimation relies on the Maximum Likelihood principle, and, as the Fixed Ploidy Variant Detection tool, the EM algorithm is used for estimating the parameters of the model. Given an initial set of parameter values for the error rates, the different multinomial models are evaluated at each site by finding the maximum likelihood estimates of the frequency parameters for each model. The model that offers the best explanation of the data (while taking care to adjust for the numbers of parameters in the multinomial model, using a criterion adopted from the Akaike Information criterion) is chosen as the current guess of the true allelic situation at that site, and given that, the error rates are re-estimated. Given the new error estimates, the maximum log likelihoods for all possible multinomial models are again evaluated and updated frequencies are produced. This procedure is performed a total of four times. After the final round of estimation the multinomial model that offers the best explanation of the data is chosen as the winning model, and variants are called according to that model.

Below we describe in detail how we choose among competing models and derive the updating equations for the EM estimation of the frequency and error rate parameters.