Overview of color space mapping

Color space mapping is done using the legacy mapping tool, as released in version 3.x of the CLC Assembly Cell. This is based on a four stage seeding approach and a seed index representing the reference genome. The mapper is able to ignore incorrect colors without obscuring the rest of the read alignment. The mapping algorithm iterates over input reads, mapping each read individually by applying the following procedure:

  1. Seeding sequences of 30 nucleotides each are sampled from each third position of the input read.
  2. These seeds are looked up in the index and the resulting candidate alignment locations are examined using a banded Smith Waterman alignment.
  3. If no valid results are found, the mapping is retried three more times with shorter seeds sampled from every individual position of the read.

As soon as any of these four, ordered attempts yields one or more valid mapping result, the procedure is aborted and the highest scoring mapping is reported. If there are multiple mappings sharing the same highest score, one is chosen randomly.

The scoring system for a color space mapping includes the same parameters discussed in the Scoring Schemes section below, and includes one additional parameter to account for color space errors. This additional penalty score has a property whereby if this penalty has been applied for a particular aligned position of your read against your reference, there is an additional effect that the rest of that read will be subject to a phase shift, corresponding to a color correction applied to the remainder of the read. Overall, this will change the score for the mapping of the read to the reference. The mapping of the read to the reference with the highest score will be the one retained. This concept is explained in more detail below.