The clc_remove_duplicates

The duplicate read removal tool is designed to filter out duplicate reads. This tool is specifically well-suited to handle duplicate reads coming from PCR amplification errors which can have a negative effect because a certain sequence is represented in artificially high numbers.

The purpose of the tool is to reduce the data set to include only one copy of the duplicate sequence. The challenge is to achieve this without removing identical or almost identical reads that would arise from high coverage of certain regions, e.g. repeat regions or highly expressed exons from transcriptome sequencing. The algorithm takes sequencing errors into account (see below).

The approach taken here is based on the raw sequencing data without any knowledge about how they map to a reference sequence. This means that this is well-suited for both de novo assembly and resequencing purposes.

Subsections

Looking for neighbors
Sequencing errors in duplicates
Paired data
Known limitations
Example of duplicate read removal

Browse the manual

The clc_remove_duplicates