It is important to take sequencing errors into account when filtering duplicate reads. Imagine an example with 100 duplicates of a read of 100 bp. If there is a random 0.1 % probability of a sequencing error, it means that 10 of these reads have an error. If the algorithm only removed the 90 identical reads, there will be 10 reads left with sequencing errors. This is a big problem since the correct sequence is only represented once.
To address this issue, the duplicate read removal program accounts for sequencing errors when it identifies duplicate reads. Specifically, reads are considered duplicates if:
- they share a common sequence of at least 20 bases8.1 in the beginning, or at any of four other regions distributed evenly across the read, and
- the rest of the read has an alignment score above 80 % of the optimal score, where the optimal score is what a read would get if it aligned perfectly to the consensus for a group of duplicates.
Please note that these thresholds for similarity are not enough for reads to be marked as duplicates - they just define how different reads are allowed to be and still be considered duplicates. Rather, the duplicates are identified as explained in Looking for neighbors
- ... bases8.1
- For paired reads, this is only 10 bases.