Sequencing errors in duplicates
clc_remove_duplicates accounts for sequencing errors when it identifies duplicate reads. This is done by defining the limits of how different reads can be and still be considered duplicates.
Reads can be considered duplicates if:
- They share enough common sequence: single reads share a common sequence of at least 20 bases at the start, or at any of four other regions distributed evenly across the read. For paired reads, the length of common sequence required is 10 bases.
- The remainder of the read has an alignment score above 80% of the optimal score, where the optimal score is what a read would score if it aligned perfectly to the consensus for that group of duplicates.
An illustration of the problem these checks are addressing: If a set of reads contained 100 duplicates of a particular 100bp read, and ther was a random 0.1 % probability of a sequencing error, 10 of those duplicates would, on average, contain an error. Without accounting for this, only the 90 identical reads would be removed, leaving the 10 duplicate reads with errors in the output.