The Low Frequency Variant caller

The purpose of the Low Frequency Variant Caller is to call variants in samples with unknown ploidy (e.g. samples of limited tumor purity from cancer patients) using read mapping data. It will detect germline as well as somatic variants, and may also be used on samples from other high-ploidy organisms, or pooled samples. Like the Fixed Ploidy Variant Caller it detects Single Nucleotide Variants (SNVs), MNVs (Multiple Nucleotide Variants), insertions, deletions as well as replacements (combinations of neighboring insertions, deletions and SNVs for which the positions are ambiguous).

The algorithm behind the The Low Frequency Variant Caller combines Multinomial models for the presence of variants and an error model for the sequencing (identical to that of the Fixed Ploidy variant caller). The Multinomial models are of the kind "there are $ k$ different variants with frequencies $ f_i$, $ i=1,...k$, $ \sum_{i=1}^k f_i=1$", where the number variants,$ k$, differ. Parameter estimation relies on the Maximum Likelihood principle, and, as the Fixed Ploidy Variant Caller, the EM algorithm is used for estimating the parameters of the model. Given an initial set of parameter values for the error rates, preliminary variants are called by examining the Log Likelihoods (evaluated at the Maximum likelihood estimates for the parameters of the multinomial models) for the data at each site for a set of Multinomial models. The model that offers the best explanation of the data (while taking care to adjust for the numbers of parameters in the Multinomial model) is chosen as the current guess of the true situation at that site, and given that, the error rates are re-estimated. Given the new error estimates, the Log Likelihoods for all possible multinomial models are again evaluated and updated variant calls are produced. This procedure is performed a total of four times. In contrast to the Fixed Ploidy variant caller it does not use a Bayesian approach for choosing among the multinomial models (since there are no meaningful prior site types), but instead uses a criterion adopted from the Akaike Information criterion, to choose among competing Mulitnomial model hypotheses for explaining the data.