Probabilistic variant detection

The purpose of the Probabilistic Variant Caller is to identify variants in a sample by using a probabilistic model built from read mapping data. This tool can detect variants in data sets from haploid (e.g. Bacteria), diploid (e.g. Human) and polyploid organisms (e.g. Cancer and higher plants) with a high sensitivity and specificity.

The algorithm used is a combination of a Bayesian model and a Maximum Likelihood approach to calculate prior and error probabilities for the Bayesian model.

Parameters are calculated on the mapped reads alone without considering the reference sequence. After observing a certain combination of nucleotides from the reads at every position in the genome, the probability for each combination of alleles (e.g. homozygous A/A, heterozygous A/G, heterozygous A/C etc.) will be determined. This probability is then used to find out which of the allele combinations (e.g. A/G) is the most likely one for each position. This can then be compared with the reference allele to find out if it is different from the reference sequence and therefore can be called as a variant. Please refer to the white paper at http://www.clcbio.com/white-paper/ for more information including benchmarks.

Variants that are adjacent are reported as one. E.g. two SNVs next to each other will be reported as one MNV. Similarly, an SNV and an adjacent deletion will be reported as one replacement. Note that variants are only reported as one when they are spported by the same reads.

The size of insertions and deletions that can be found depend on how the reads are mapped: Only indels that are spanned by reads will be detected. This means that the reads have to align both before and after the indel. In order to detect larger insertions and deletions, please use the InDels and Structural Variation tool instead.

Please note that the variants reported by the structural variation tool can be fed into the local realignment tool to re-adjust the alignment of the reads to span the indels, making some of the indels detected by the structural variation ready to be picked up by the probabilistic variant detection. Note: In the current version, the probabilistic variant detection is not designed to detect minor variants (like rare alleles) with a frequency of less than 15%. If you are expecting a allele frequency of less than 15% we would recommend setting a higher ploidy level during your analysis or alternatively, using the quality-based variant detection algorithm (see Quality-based variant detection) with a post-filtering step for average base quality and forward-reverse read balance.

Image SNP-example
Figure 26.12: An example of a heterozygous variant surrounded by a lot of noise from sequencing errors.



Subsections