Mapping parameters
Clicking Next leads to the parameters for the read mapping (see figure 20.3).
Figure 20.3: Setting parameters for the mapping.
The first parameter allows the mismatch cost to be adjusted:
- Match score The positive score for a match between the read and the reference sequence. It is set by default to 1 but can be adjusted up to 10.
- Mismatch cost The cost of a mismatch between the read and the reference sequence. Ambiguous nucleotides such as "N", "R" or "Y" in read or reference sequences are treated as a mismatches and any column with one of these symbols will therefore be penalized with the mismatch cost.
- Linear gap cost The cost of a gap is computed directly from the
length of the gap and the insertion or deletion cost. This model
often favors small, fragmented gaps over long contiguous gaps.
If you choose linear gap cost, you must set the insertion cost and the
deletion cost:
- Insertion cost
- The cost of an insertion in the read (a gap in the reference sequence). The cost of an insertion of length will be
Insertion cost (20.1)
- Deletion cost
- The cost of a deletion in the read (gap in the read sequence). The cost of a deletion of length will be
Deletion cost (20.2)
- Affine gap cost An extra cost associated with opening a gap is
introduced such that long contiguous gaps are favored over short
gaps.
If you chose affine gap cost, you must also set the cost of opening an
insertion or a deletion:
- Insertion open cost
- The cost of opening an insertion in the read (a gap in the reference sequence).
- Insertion extend cost
- The cost of extending an insertion in the read (a gap in the reference sequence) by one column.
- Deletion open cost
- The cost of a opening a deletion in the read (gap in the read sequence).
- Deletion extend cost
- The cost of extending a deletion in the read (gap in the read sequence) by one column.
Using, affine gap cost, an insertion of length is penalized by
Insertion open cost Insertion extend cost (20.3)
and a deletion of length is penalized byDeletion open cost Deletion extend cost (20.4)
In this way long consecutive gaps get a lower cost per column than small fragmented gaps and they are therefore favored.
The score of a match between the read and the reference is set to 1 by default. Adjusting the cost parameters above can improve the mapping quality, e.g. when the read error rate is high or the reference is expected to differ significantly from the sequenced organism. For example, if the reads are expected to contain many insertions and/or deletions, it can be a good idea to lower the insertion and deletion costs to allow more of such errors. However, one should also consider the possible drawbacks when adjusting these settings. For example, reducing the insertion and deletion cost increases the risk of mapping reads to the wrong positions in the reference.
Figure 20.4: An alignment of a read where a region of 35bp at the start of the read is unaligned while the remaining 57 nucleotides matches the reference.
Figure 20.4 shows an example using linear gap cost where the read mapper is unable to map a region in a read due to insertions in the read and mismatches between the read and the reference. The aligned region of the read has a total of 57 matching nucleotides which result in an alignment score of 57 which is optimal when using the default cost for mismatches and insertions/deletions (2 and 3 respectively). If the mapper had aligned the remaining 35bp of the read as shown in Figure 20.5 using the default scoring scheme, the score would become:
(20.5) |
In this case, the alignment shown in Figure 20.4 is optimal since it has the highest score. However, if either the cost of deletions or mismatches were reduced by one, the score of the alignment shown in Figure 20.5 would become 61 and 58, respectively, and thus make it optimal.
Figure 20.5: An alignment of a read containing a region with several mismatches and deletions. By reducing the default cost of either mismatches or deletions the read mapper can make an alignment that spans the full length of the read.
Once the optimal alignment of the read is found, based on the cost parameters described above, a filtering process determines whether this match is good enough for the read to be included in the output. The filtering threshold is determined by two factors:
- Length fraction The minimum percentage of the total alignment length that must match the reference sequence at the selected similarity fraction. A fraction of 0.5 means that at least half of the alignment must match the reference sequence before the read is included in the mapping (if the similarity fraction is set to 1). Note, that the minimal seed (word) size for read mapping is 15 bp, so reads shorter than this will not be mapped.
- Similarity fraction The minimum percentage identity between the aligned region of the read and the reference sequence. For example, if the identity should be at least 80% for the read to be included in the mapping, set this value to 0.8. Note that the similarity fraction relates to the length fraction, i.e. when the length fraction is set to 50% then at least 50% of the alignment must have at least 80% identity (see figure 20.6).
Figure 20.6: A read containing 59 nucleotides where the total alignment length is 60. The part of the alignment that gave rise to the optimal score has length 58 which excludes 2 bases at the left end of the read. The length fraction of the matching region in this example is therefore 58/60 = 0.97. Given a minimum length fraction of 0.5, the similarity fraction of the alignment is computed as the maximum similarity fraction of any part of the alignment which constitute at least 50% of the total alignment. In this example the marked region in the alignment with length 30 (50% of the alignment length) has a similarity fraction of 0.83 which will satisfy the default minimum similarity fraction requirement of 0.8. - Global alignment By default, mapping is done with local alignment of the reads to the reference. The advantage of performing local
alignment instead of global alignment is that the ends are automatically left unaligned if there are many differences from the reference at the ends. For many sequencing platforms, the quality of the bases drop along the read, and a local alignment approach is desirable. Note that the aligned region has to be greater than the length threshold set. If global alignment is preferred, it can be enabled with a checkbox as shown in figure 20.3.
- Color space alignment When mapping data in color space (data from SOLiD systems), the color space checkbox is enabled, and a corresponding cost for color errors can be set. If you do not have color space data, these will be disabled and are not relevant. For more details about this, please see the section on Color space which explains how color space mapping is performed in greater detail.