The purpose of the tool is to reduce the data set to include only one copy of the duplicate sequence. The challenge is to achieve this without removing identical or almost identical reads that would arise from high coverage of certain regions, e.g. repeat regions or highly expressed exons from transcriptome sequencing. The algorithm takes sequencing errors into account (see below).
The approach taken here is based on the raw sequencing data without any knowledge about how they map to a reference sequence. This means that this is well-suited for both de novo assembly and resequencing purposes.
- Looking for neighbors
- Sequencing errors in duplicates
- Paired data
- Known limitations
- Example of duplicate read removal