Extract Consensus Sequence
Using the Extract Consensus Sequence tool, a consensus sequence can be extracted from all kinds of read mappings, including those generated from de novo assembly or RNA-seq analyses. In addition, you can extract a consensus sequence from nucleotide BLAST results.
Note: Consensus sequences can also be extracted when viewing a read mapping by right-clicking on the name of the consensus or reference sequence, or a selection of the reference sequence, and selecting the option Extract New Consensus Sequence () from the menu that appears. The same option is available from the graphical view of BLAST results when right-clicking on a selection of the subject sequence.
To start the Extract Consensus Sequence tool, go to:
Toolbox | Resequencing Analysis () | Extract Consensus Sequence ()
In the first step, select the read mappings or nucleotide BLAST results to work with.
In the next step, options affecting how the consensus sequence is determined are configured (see figure 27.46).
Figure 27.46: Specifying how the consensus sequence should be extracted.
Handling low coverage regions
The first step is to define a low coverage threshold. Consensus sequence is not generated for reference positions with coverage at or below the threshold specified.
The default value is 0, which means that a reference base is considered to have low coverage when no reads cover this position. Using this threshold, if just a single read covered a particular position, only that read would contribute to the consensus at that position. Setting a higher threshold gives more confidence in the consensus sequence produced.
There are several options for how low coverage regions should be handled:
- Remove regions with low coverage. When using this option, no consensus sequence is created for the low coverage regions. There are two ways of creating the consensus sequence from the remaining contiguous stretches of high coverage: either the consensus sequence is split into separate sequences when there is a low coverage region, or the low coverage region is simply ignored, and the high-coverage regions are directly joined. In this case, an annotation is added at the position where a low coverage region is removed in the consensus sequence produced (see below).
- Insert 'N' ambiguity symbols. This simply adds Ns for each base in the low coverage region. An annotation is added for the low coverage region in the consensus sequence produced (see below).
- Fill from reference sequence. This option uses the sequence from the reference to construct the consensus sequence for low coverage regions. An annotation is added for the low coverage region in the consensus sequence produced (see below).
Handling conflicts
Settings are provided in the lower part of the wizard for configuring how conflicts or disagreement between the reads should be handled when building a consensus sequence in regions with adequate coverage.
- Vote When reads disagree at a given position, the base present in the majority of the reads at that position is used for the consensus.
- If the most common symbol is a gap, then the consensus symbol is a gap.
- If there are equal numbers of gaps and non-gaps, then it is one of the other symbols.
- When choosing between symbols, we choose in the order A - C - G - T.
- Ambiguous symbols cannot be chosen.
If the Use quality score option is also selected, quality scores are used to decide the base to use for the consensus sequence, rather than the number of reads. The quality scores for each base at a given position in the mapping are summed, and the base with the highest total quality score at a given position is used in the consensus. If two bases have the same total quality score at a location, we follow the order of preference listed above.
Information about biological heterozygous variation in the data is lost when the Vote option is used. For example, in a diploid genome, if two different alleles are present in an almost even number of reads, only one will be represented in the consensus sequence.
- Insert ambiguity codes When reads disagree at a given position, an ambiguity code representing the bases at that position is used in the consensus. (The IUPAC ambiguity codes used can be found in the Appendix.)
Unlike the Vote option, some level of information about biological heterozygous variation in the data is retained using this option.
To avoid the situation where a different base in a single read could lead to an ambiguity code in the consensus sequence, the following options can be configured:
- Noise threshold The percentage of reads where a base must be present at given position for that base to contribute to an ambiguity code. The default value is 0.1, i.e. for a base to contribute to an ambiguity code, it must be present in at least 10 % of the reads at that position.
- Minimum nucleotide count The minimum number of reads a particular base must be present in, at a given position, for that base to contribute to the consensus.
If no nucleotide passes these two thresholds at a given position, that position is omitted from the consensus sequence.
If the Use quality score option is also selected, summed quality scores are used, instead of numbers of reads for conflict handling. To contribute to an ambiguity code, the summed quality scores for bases at a given position must pass the noise threshold.
In the next step, output options are configured (figure 27.47).
Figure 27.47: Choose to add annotations to the consensus sequence.
Consensus annotations
The annotations that can be added to the consensus sequence provide information about conflicts that have been resolved and about low coverage regions (unless you have chosen to split the consensus sequence). Note that for large data sets, a very high number of annotations may be generated, which will cause the tool to take longer to complete. The results will also take up much more disk space.
For stand-alone read mappings, it is possible to transfer existing annotations to the consensus sequence. Since the consensus sequence produced may be broken up, the annotations will also be broken up, and thus may not have the same length as before. In some cases, gaps and low-coverage regions will lead to differences in the sequence coordinates between the input data and the new consensus sequence. The annotations copied will be placed in the region on the consensus that corresponds to the region on the input data, but the actual coordinates might have changed.
Track-based read mappings do not themselves contain annotations and thus the options related to transferring annotations, "Transfer annotations from the reference sequence" and "Keep annotations already on consensus", cannot be selected for this type of input.
Copied/transferred annotations will contain the same qualifier text as the original. That is, the text is not updated. As an example, if the annotation contains 'translation' as qualifier text, this translation will be copied to the new sequence and will thus reflect the translation of the original sequence, not the new sequence, which may differ.
Quality scores on the consensus sequence
The resulting consensus sequence (or sequences) will have quality scores assigned if quality scores were found in the reads used to call the consensus. For a given consensus symbol we compute its quality score from the "column" in the read mapping. Let be the sum of all quality scores corresponding to the "column" below , and let be the sum of all quality scores from that column that supported 27.1. Let , then we will assign the quality score of where
Footnotes
- ...#tex2html_wrap_inline197714#27.1
- By supporting a consensus symbol, we understand the following: when conflicts are resolved using voting, then only the reads having the symbol that is eventually called are said to support the consensus. When ambiguity codes are used instead, all reads contribute to the called consensus and thus .