Readmapping
Indexing
When provided with a reference genome, LightSpeed first generates a Burrows-Wheeler based index of all the sequences. After the first run, the index is cached and reused on later runs.
Read mapping and read pairs
LightSpeed maps reads to the indexed reference sequence.
Single reads that are part of a paired read are mapped individually in the following steps:
- Seeding All possible stretches of exact matches (seeds) to the reference are identified. Seeds that are a sub-match of a longer seed or shorter than 2/3 the length of the longest seed are skipped.
- Extension Seeds are extended using a Needleman-Wunsch based method. Only seeds with extensions scoring at least 4/5 the score of the highest scoring extension are kept.
- Pairing A search through all combinations of extensions is conducted and all proper pairs scoring the maximum score are collected.
Read pairs that did not map well or were not paired, go through a second round of more thorough seeding:
- Secondary seeding A search for shorter seeds, that are sub-matches of longer seeds, is conducted.
- Secondary extension All seeds, including the seeds shorter than 2/3 the length of the longest seed, are extended.
- Pairing A search through all combinations of extensions is conducted, i.e., both from the primary and secondary extension.
If there are multiple paired extensions with the highest score, one of the pairs is selected at random and the read pair is reported as non-specific.
The distance at which reads can be considered as pairs is estimated from a subset of the reads. If there is not enough data to estimate the distance, a default insert size of 1-1000 base pairs is used. Read pairs that map within the expected distance of each other are considered pairs, read pairs that map further away from each other are considered broken pairs.
Unaligned ends of read pairs that are specifically mapped are, during read mapping, reattempted aligned by allowing for one mismatch. The mismatch is, however, not accepted on the last position of the read. Note that this process is not carried out in the first and last 10 bases of a chromosome.
The algorithm has been optimized for the typical read length and error profile of Illumina 150 bp paired-end reads.