Looking for neighbors

clc_remove_duplicates works directly on the sequencing reads to identify duplicates by looking for "neighboring" reads, that is, reads that share most of the sequence but with a small offset. These are used to determine whether there is generally high coverage for this sequence. If there is not, the read in question will be marked as a duplicate.

For certain sequencing platforms such as 454, reads have varying lengths. This is taken into account by the algorithm.