Deduplication
Deduplication can be used to collapse reads that likely represent the same original DNA fragment.
Reads are deduplicated through the following steps:
- Reads pairs whose outer positions are identical are considered duplicates. The outer positions are usually the 5' ends of R1 and R2 including unaligned bases.
- For each group of duplicate reads, a consensus sequence is calculated:
- At conflicting positions, the most common base is included in the consensus read.
- If the conflicting bases are equally represented, the consensus can be generated in two ways:
- When one of the bases at the conflicting position is identical to the reference symbol, the reference symbol is included in the consensus read.
- When none of the bases at the conflicting position is identical to the reference symbol, an N is inserted in the consensus read.
- In the read mapping, the duplicate read pairs are replaced with the consensus sequence.
Q-scores are assigned to the bases in the consensus read as follows:
- The Q-score assigned to a base in the consensus read is calculated as the average of the quality scores of the underlying bases.
- At conflicting positions, where there is a most common base, the Q-score assigned to a base in the consensus read is calculated as the average of the quality scores of the reads with the winning nucleotide.
- If the conflicting bases are equally represented, the consensus base will either be the reference symbol or N. In both cases no Q-score assignment is made to the base.
Because deduplication relies on the outer positions of read pairs originating from the same fragment to be identical, quality trimming can reduce the number of reads that are deduplicated.