Overview of base space mapping
The base space read mapping tool is based on an uncompressed suffix array that represents the entire reference genome in a single data structure.
The algorithm iterates over input reads, mapping each read individually by applying the following procedure:
- A search is carried out for the longest stretches of matching bases between the reference genome and a read by considering each base position of the read as a start position of a seed candidate.
- End-positions of seeds are then determined by elongating the seeds as long as there are fully matching rows in the suffix array.
- Finally, a maximum of 100 seeds is examined in detail using a banded Smith-Waterman algorithm.
The seed lengths in this mapping tool is variable but has a minimum size of 15bp. The variable seed length enable identification of short seeds where the alignment score is higher than the alignment score for longer seeds. This leads to a better mapping of some reads, and improves the chance of identifying the optimal mapping, especially for reads with high error rates.
The memory consumption of the clc_mapper tool is bounded from below by , where equals the size of the reference genome. So, for example, the suffix array of the human genome, which is approximately 3 gigabases, consumes 15 gigabytes of main memory. Whilst such a requirement might appear rather large, it then allows for extremely good performance in both seeding and extension stages. The result is a very fast run time. Examples are provided in our white paper on the base space read mapper, available from our website: http://www.clcbio.com/files/whitepapers/whitepaper-on-CLC-read-mapper.pdf