The clc_sort_pairs Program

The clc_sort_pairs program takes two SOLiD read files, or two Ion Torrent read files, as input and generates as output a file containing paired reads and a file containing unpaired reads. Here the read names are used to sort the reads. This tool is necessary because pairing of the reads in these cases is based on the read names rather than just the position within the file, as would be the case for Illumina data. That is, paired reads for these data types need to be sorted, and paired reads separated from single reads.

To properly handle the input sequence data, the read names within the files must match certain patterns. These are:

Ion Torrent:

[anytextinfo]:[number]:[number]

Solid:

[number]_[number]_[number]_[R3/F3/F5/F5-P2/F5-BC]

In the case of Solid data, reads in one file of the pair should end in one of the patterns above, and reads in the other file of the pair should end with one of the other patterns. For example, one file might contain reads with names ending in R3, while reads in the other file have names ending in F5, or reads in one file might contain names ending in F3, while reads in the other file have names ending in F5, and so on.

Reads within a given file must be named consistently. That is, if a read has a name ending with a particular pattern, for example F5, then all reads in that file must have names ending in F5.

Please note that in the case of Solid data, the following combinations for the read names in a pair of files are not allowed:

F3 + F3
R5 + F5
R3 + R3
R3 + F5
F5 + R3

As mentioned in the Input Data section earlier in the manual, the full sequence of any read containing one or more . symbols, present in a .csfasta format file, will be converted to contain only N characters.

Further details are provided in Options for All Programs.