Algorithm details and parameters

The algorithm operates with the following assumption: Mapped reads from duplicated DNA fragments will share a mapping orientation (e.g. will map to the same strand), and depending on their orientation, will share either a start coordinate (forward reads), an end coordinate (reverse reads) or both (paired end reads).

The parameter "Maximum representation of minority sequence (percent)", if set to 20%, means that two groups of reads differing by one mismatch are duplicates when the smaller group contains at most 20% of the total number of reads in both groups.

The algorithm works by first making groups of reads that are exactly equal. Then pairs of these groups are merged if they are duplicated. If the groups have no mismatches, then they are equal except for read lengths and are always merged. If the groups have one mismatch, then the groups are merged if the small group is less than the bigger group by threshold %. The threshold is halved again for each additional mismatch symbol. If the initial threshold is 20%, then it is 10% for two mismatches, 5% for three mismatches, 2.5% for four mismatches and so on.

The output has one read per group of reads left after the merging steps. The read chosen as output for a group is the read with the highest average base quality score from the biggest exactly equal read group.