Gap placement
In the case of insertions or deletions in homopolymeric or repetitive regions, the precise placement of the insertion or deletion cannot be determined from the data. An example is shown in figure 21.17.
Figure 21.17: Three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 5' end, but could have been placed towards the 3' end with an equally good mapping score for the read.
In this example, three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 5' end (left side), but could have been placed towards the 3' end with an equally good mapping score for the read as shown in figure 21.18.
Figure 21.18: Three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 3' end, but could have been placed towards the 5' end with an equally good mapping score for the read.
Since either way of placing the gap is arbitrary, the goal of the mapper is to place the gaps consistently at the same side for all reads.
Many insertions and deletions in homopolymeric or repetitive regions reported in the public databases dbSNP and 1000Genomes have been identified based on mappings done with tools like BWA and Bowtie, that place insertions or deletions at the left side of a homopolymeric tract. Thus, to help facilitate the comparison of variant results with such public resources, the CLC bio Map Reads to Reference tool, as of version 6.5 of the CLC Cancer Research Workbench, will place insertions or deletions in homopolymeric tracts at the left hand side.
This is a change to earlier versions of the CLC Cancer Research Workbench (version 6.0.5 and earlier) where the CLC bio read mapper placed insertions and deletions in homopolymeric tracts at the right hand side of the homopolymer, as viewed in the Workbench.
This has the implication that insertion and deletion variants called in homopolymeric regions will be in different positions relative to the reference when based on mappings run in version 6.0.5 and earlier, compared to variant calls based on mappings run in version 6.5 and later. Thus, if comparisons between sample variant tracks will be done in the CLC Cancer Research Workbench, we recommend re-running mappings so all samples are mapped using the mapping tool in version 6.5 of the CLC Cancer Research Workbench or higher, or all samples to be compared have been mapped using version 6.0.5 and lower.
Subsections