The clc_overlap_reads Program

In cases where paired end library preparation methods use a relatively short fragment size, some read pairs will overlap. These overlapping reads can be handled as standard paired-end data.

However, in some situations it can be useful to merge the overlapping pairs into a single read. The benefit is that you get longer reads, and that the quality improves (normally the quality drops towards the end of a read, and by overlapping the ends of two reads, the consensus read now reflects two read ends instead of just one).

This joining of overlapping reads can be done using the clc_overlap_reads program. It aligns the ends of each read within pairs to see if there is evidence that they overlap. If the alignment of these read ends is relatively good, the reads are joined into one read and put in an output file for single (joint) reads. If there is no evidence of the reads overlapping, the original pair of reads is put in an output file for paired reads.

The nucleotides in the overlapping region of a joint read are assigned a quality score of 40 (very high quality) if the two reads agree on the nucleotide. Otherwise the nucleotide with the highest quality is chosen and its quality score is retained. Use the "-f 33" option if the quality offset value in the input fastq files is based on the ASCII character 33. This is the most common situation.

By default, the alignment between the ends of two reads must have a minimum length of 10 positions and a minimum similarity of 90% for the reads to be considered overlapping. These parameters can be adjusted using the various options for the program. The default is that the first read of each pair is a forward read and the other one is a backward read. This can also be adjusted.

Further details of the overlap reads options are provided in Options for All Programs.