Sequencing errors in duplicates

It is important to take sequencing errors into account when filtering duplicate reads. Imagine an example with 100 duplicates of a read of 100 bp. If there is a random 0.1 % probability of a sequencing error, it means that 10 of these reads have an error. If the algorithm only removed the 90 identical reads, there will be 10 reads left with sequencing errors. This is a big problem since the correct sequence is only represented once.

To address this issue, the duplicate read removal program accounts for sequencing errors when it identifies duplicate reads. Specifically, reads are considered duplicates if:

Please note that these thresholds for similarity are not enough for reads to be marked as duplicates - they just define how different reads are allowed to be and still be considered duplicates. Rather, the duplicates are identified as explained in Looking for neighbors.



Footnotes

... bases8.1
For paired reads, this is only 10 bases.