Sequencing errors in duplicates

clc_remove_duplicates accounts for sequencing errors when it identifies duplicate reads. This is done by defining the limits of how different reads can be and still be considered duplicates.

Reads can be considered duplicates if:

An illustration of the problem these checks are addressing: If a set of reads contained 100 duplicates of a particular 100bp read, and ther was a random 0.1 % probability of a sequencing error, 10 of those duplicates would, on average, contain an error. Without accounting for this, only the 90 identical reads would be removed, leaving the 10 duplicate reads with errors in the output.