The clc_split_reads Program (for 454 paired data)
The 454 sequencing technology can produce paired read files where the two paired read fragments are contained within the same read, separated by a linker sequence. The linker may be placed anywhere in the read or even outside the read, so not all the reads will necessarily contain a pair. The clc_split_reads program finds the linker sequence and creates two new files, one with paired reads and one with unpaired reads.
Like adapter regions, linker regions may contain sequencing errors. With this in mind, the clc_split_reads tool identifies likely linker sequences by initially carrying out an alignment between the known linker sequence and each read. The alignment is global in terms of the linker and local in terms of the reads. That is, the whole linker must align to part of the read. Matching positions in these alignments score 1, while each mismatch costs 2 and each gap costs 3. For alignments found at the ends of reads, any non-matching linker bases that extend beyond the end of the read are not penalized.
By default, a region that aligns with a score of at least 10, or the length of the linker if less than 10, is considered a good enough match to identify a linker region.
If a match to the linker sequence with a score between 0 and 9 is found, the read will still be split, but in this situation the following will happen:
- The two parts of the read that have just been split are put into the singles list (not the paired list). The reasoning behind this is that since the linker did not match with a good enough score, the match location identified might not have been correct. If this was the case, marking such split sequences as a pair would mean that the paired distance information would be used in a de novo assembly, and this could end up being misleading within the assembly process.
- Any linker matches identified at the end of the read will also be trimmed. This extra trimming stage is carried out due to the possibility that the internal linker match identified might not have been correct.
The following situations are particularly detrimental to de novo assembly, and the clc_split_reads program tries to ensure they are avoided:
- Reads contain some remaining section of the linker sequence
- Reads are categorized as a pair when they should not be
In some cases, the start or end of a read is in the middle of the linker. In such cases, the linker sequence is still removed, and the read is put into the file with unpaired reads. If only very few nucleotides of the linker overlap with the read, they are also removed, even though they may not come from the linker. In the case where only a single nucleotide at the start and/or end of the read may come from a linker, it is removed. The rationale is that it is better to discard a few nucleotides and be sure there is no adapter sequence left, since remaining linker sequence is problematic for de novo assembly.
The `-m' option can be used to specify the minimum read length. Only reads this long or longer will be reported. The default value is 15. This becomes important when the linker is close to the start or end of the read, and only a small fragment is left on one side of the linker. If that small fragment is below the specified minimum length, it is discarded along with the linker. The remaining part of the read will be written to the unpaired file.
Further details of the options for this tool are provided in Options for All Programs.