CLC Manuals - clcsupport.com

Assessing the quality of the neighborhood bases

The variant detection will look at each position in the mapping to determine if there is an SNV, MNV, replacement, deletion or insertion at this position.

Variants that are adjacent are reported as one. E.g. two SNVs next to each other will be reported as one MNV. Similarly, an SNV and an adjacent deletion will be reported as one replacement. Note that variants are only reported as one when they are supported by the same reads.

The size of insertions and deletions that can be found depend on how the reads are mapped: Only indels that are spanned by reads will be detected. This means that the reads have to align both before and after the indel. In order to detect larger insertions and deletions, please use the InDels and Structural Variation tool instead.

Please note that the variants reported by the structural variation tool can be fed into the local realignment tool to re-adjust the alignment of the reads to span the indels, making some of the indels detected by the structural variation ready to be picked up by the quality-based variant detection. In order to make a qualified assessment, the quality-based variant detection also considers the general quality of the neighboring bases. The Neighborhood radius is used to determine how far away from the current variant this quality assessment should extend, and it can be specified in the upper part of the dialog. Note that at the ends of the read, an asymmetric window of the specified length is used.

If the mapping is based on local alignment of the reads, there will be some reads with un-aligned ends (these ends are faded when you look at the mapping). These unaligned ends are not included in the scanning for variants but they are included in the quality filtering (elaborated below).

In figure 31.2, you can see an example with a neighborhood radius of 5. The current position is high-lighted, and the horizontal high-lighting marks the nucleotides considered for a read with the radius set to 5.

Image snp_explanation_2
Figure 31.2: An example of a neighborhood radius of 5 nucleotides.

For each read and within the given radius,^31.1 the following two parameters are used to assess the quality:

Minimum neighborhood quality. The average quality score of the nucleotides in a read within the specified radius has to exceed this threshold for the base to be included in the calculation for this position (learn more about importing quality scores from different sequencing platforms in Import high-throughput sequencing data).
Maximum gap and mismatch count. The number of gaps and mismatches allowed within the window length of the read. Note that this is excluding the "mismatch" or gap that is considered a potential variant. If there are more gaps or mismatches than this threshold within the radius, this read will not be included in the variant calculation at this position. Unaligned regions (the faded parts of a read) also count as mismatches, even if some of the bases match.

Note that for sequences without quality scores, the quality score settings will have no effect. In this case only the gap/mismatch threshold will be used for filtering low quality reads.

Figure 31.2 shows an example of a read with a mismatch, marked in dark blue. The mismatch is inside the radius of 5 nucleotides.

When looking at a position near the end of a read (like the read at the bottom in figure 31.2), the window will be asymmetric as shown in figure 31.3.

Image snp_explanation_3
Figure 31.3: A window near the end of a read.

Besides looking horizontally within a window for each read, the quality of the central base is also examined: Minimum quality of central base. This is the quality score for the central base, i.e. the bases in the column high-lighted in figure 31.4. Bases with a quality score below this value are not considered in the variant calculation at this position.

Image snp_explanation_1
Figure 31.4: A column of central bases in the neighborhood.

In addition to low-quality reads, reads can also be filtered further:

Ignore non-specific matches: This will ignore all reads that are marked as non-specific matches (see Non-specific matches). This is generally recommended, since there is no way of knowing whether the reads and thereby the variant are mapped to the correct position.
Ignore broken pairs: This will ignore all reads that come from broken pairs (see Mapping paired reads). We recommend to switch on the 'Ignore broken reads' filter in case data included paired-reads. As paired-reads have a larger overall alignment with the reference genome, the alignment is more trustworthy than an alignment with a single read, because the probability that the pair could map somewhere else is lower. However, variants in regions with larger deletions, insertions or rearrangements will be ignored, as broken pairs are often indicators for these kinds of events. Note that if you have mapped a combination of single and paired reads, the reads that were marked as single when running the mapping will still be part of the variant detection, even if you have chosen to ignore broken pairs.

Please note that all the filtering described here means that sometime there is a difference between the coverage of the mapping and the actual counts reported for a variant. The difference would be the number of reads that have been filtered before variant calling.

Footnotes

... radius,^31.1: The radius is defined as the number of positions in the local alignment between that particular read and the reference sequence (for de novo assembly it would be the consensus sequence).)

Browse the manual

Assessing the quality of the neighborhood bases

Footnotes