Fixed Ploidy Variant Detection

Fixed Ploidy Variant Detection is designed for variant calling in resequencing data using a fixed ploidy model (haploid to tetraploid). Ploidy values above tetraploid are not supported, as the number of possible site types becomes too large for feasible estimation and computation. For samples with ploidy values above tetraploid, we recommend Low Frequency or Basic Variant Detection instead.

This tool is designed for short reads. As a result, using it with Oxford Nanopore or PacBio long reads may lead to excessive runtime and memory usage, particularly for (1) whole-genome sequencing or (2) datasets that include regions with very high coverage. In addition, homopolymer errors are more prevalent in long-read sequencing, and many of these errors may be reported as variants. The CLC LightSpeed Module provides solutions suitable for long-read sequencing data.

Fixed Ploidy Variant Detection relies on two models:

A model for the possible 'site-types' depends on the user-specified ploidy parameter: For a diploid organism there are two alleles and thus the site types are A/A, A/C, A/G, A/T, A/-, C/C, and so on until -/-.
A model for the sequencing errors that specifies the probabilities of having a certain base in the read but calling a different base. The error model is estimated from the data prior to calling the variants (see The Error Model estimation).

The algorithm will, given the estimated error model and the data observed in the site, calculate the probabilities of each of the site types. One of those site types is the site that is homozygous for the reference - that is, it stipulates that whatever differences are observed from the reference nucleotide in the reads is due to sequencing errors. The remaining site-types are those which stipulate that at least one of the alleles in the sample is different from the reference. The sum of the probabilities for these latter site types is the posterior probability that the sample contains at least one allele that differs from the reference at this site. We refer to this posterior probability as the 'variant probability'.

The tool has two parameters: the 'Ploidy' and the 'Variant probability' parameters (figure 32.5):

The 'ploidy' is the ploidy of the analyzed sample. This determines the site types that are considered in the model. The tool strongly depends on the chosen ploidy, so the validity of the value for the sample should be carefully considered.
For more information about ploidy please see Ploidy and sensitivity.
The 'Required variant probability' is the minimum probability value of the 'variant site' required for the variant to be called. Note that it is not the minimum value of the probability of the individual variant. For the Fixed Ploidy Variant detector, if a variant site - and not the variant itself - passes the variant probability threshold, then the variant with the highest probability at that site will be reported even if the probability of that particular variant might be less than the threshold. For example if the required variant probability is set to 0.9 then the individual probability of the variant called might be less than 0.9 as long as the probability of the entire variant site is greater than 0.9.

Image fixedploidyparamters
Figure 32.5: The Fixed Ploidy Variant Detection parameters.

Ploidy and sensitivity

The Fixed Ploidy Variant Detection tool has two parameters. The ploidy level you set defines the statistical model that will be used during the variant detection analysis and thereby also defines what will be reported. The number of alleles that variant may have depends on the value that has been chosen for the ploidy parameter. For example, if you chose a ploidy of 2, then the variant at a site could be a homozygote (two alleles the same in the sample, but different to the reference), or a heterozygote (two alleles different than each other in the sample, with at least one of them different from the reference). If you had chosen a ploidy of three, then the variant at a site could be a homozygote (three alleles the same in the sample, but different to the reference), or a heterozygote (three alleles different than each other in the sample, with at least one of them different from the reference).

The variant probability parameter defines how good the evidence has to be at a particular site for the tool to report a variant at that location. If the site passes this threshold, then the variant with the highest probability at that site will be reported.

Sensitivity of the tool can be altered by changing these parameters: to increase sensitivity, you could decrease the variant probability setting - more sites are being reported - or increase the ploidy - adding extra allele types.

For example, a sample with a ploidy of 2 has many C and a few G at a particular location where the reference is a T. There is high enough evidence that the actual position is different than the reference, so the variant with the highest probability at this location will be reported. In the diploid model, all the possibilities will have been tested (e.g. A|A, A|C....C|C, C|G. C|T....and so on). In this example, C|C had the highest probability, and as long as the relative prevalence of Gs is low compared to Cs - that is, the probability of C|C stays higher than C|G - C|C will be reported. But in a case where the sample has a ploidy of 3, the model will test all the triploid possibilities (e.g. A|A|A, A|A|C, A|A|G.....C|C|A, C|C|C, C|C|G.... and so on). For the same site, if the evidence in the reads results in the variant C|C|G having a higher probability than C|C|C, then it would be the variant reported. This shows that by increasing ploidy we have increased sensitivity of the tool, reporting a variant that represents the reads with G as well as the ones reporting a C at a particular position.

Browse the manual

Fixed Ploidy Variant Detection

Ploidy and sensitivity