Read Mapping

The read mapper, clc_mapper, maps a list of sequencing reads to a set of reference sequences, collectively referred to as the reference genome.

For each sequencing read, the read mapper reports the location(s) in the reference genome where that read is most likely to have originated from. The reported location is the result of this procedure:

  1. A search is carried out for the longest stretches of matching bases between the reference genome and a read by considering each base position of the read as a start position of a seed candidate.

  2. End-positions of seeds are then determined by elongating the seeds as long as there are fully matching sequences in the reference index.

  3. Seeds are reduced down to 2/3 of the length of the longest one.

  4. Finally, the seeds are examined in detail using a banded Smith-Waterman algorithm. Seeds from paired reads are examined together.

The seed lengths in this mapping tool are variable, but have a minimum size of 15bp. The variable seed length enables identification of short seeds where the alignment score is higher than the alignment score for longer seeds. This leads to a better mapping of some reads, and improves the chance of identifying the optimal mapping, especially for reads with high error rates.



Subsections