The clc_overlap_reads Program
In cases where paired end library preparation methods use a relatively short fragment size, some read pairs will overlap. These overlapping reads can be handled as standard paired-end data.However, in some situations it can be useful to merge the overlapping pairs into a single read. The benefit is that you get longer reads, and that the quality improves (normally the quality drops towards the end of a read, and by overlapping the ends of two reads, the consensus read now reflects two read ends instead of just one).
This joining of overlapping reads can be done using the clc_overlap_reads program. It aligns the ends of each read within pairs to see if there is evidence that they overlap. If the alignment of these read ends is relatively good, the reads are joined into one read and put in an output file for single (joint) reads. If there is no evidence of the reads overlapping, the original pair of reads is put in an output file for paired reads.
The nucleotides in the overlapping region of a joint read are assigned a quality score of 40 (very high quality) if the two reads agree on the nucleotide. Otherwise the nucleotide with the highest quality is chosen and its quality score is retained.
By default, the alignment between the ends of two reads must have a minimum length of 10 positions and a minimum similarity of 90% for the reads to be considered overlapping. These parameters can be adjusted using the various options for the program. The default is that the first read of each pair is a forward read and the other one is a backward read. This can also be adjusted.
In cases where reads contain an adapter sequence at the read ends, the adapter sequence needs to be removed (see figure 8.3). The option -autoadapter enables free gaps at the read ends and leaves out unaligned read ends, i.e. any adapter sequences will not be included in the result.
Further details of the overlap reads options are provided in Options for All Programs.
Figure 8.21: Illustration of two reads originating from a fragment which is shorter than the read length. Both reads contain adapter sequences in the read ends which needs to be removed when overlapping the reads.