Extract Consensus Sequence

Using the Extract Consensus Sequence tool, a consensus sequence can be extracted from all kinds of read mappings, including those generated from de novo assembly or RNA-Seq analyses. In addition, you can extract a consensus sequence from nucleotide BLAST results.

Note: Consensus sequences can also be extracted when viewing a read mapping by right-clicking on the name of the consensus or reference sequence, or a selection of the reference sequence, and selecting the option Extract New Consensus Sequence () from the menu that appears. The same option is available from the graphical view of BLAST results when right-clicking on a selection of the subject sequence.

To start the Extract Consensus Sequence tool, go to:

Toolbox | Resequencing Analysis () | Extract Consensus Sequence ()

In the first step, select the read mappings or nucleotide BLAST results to work with.

In the next step, options affecting how the consensus sequence is determined are configured (see figure 30.43).

Image extract_consensus_step2
Figure 30.43: Specifying how the consensus sequence should be extracted.

Handling low coverage regions

The first step is to define a low coverage threshold. Consensus sequence is not generated for reference positions with coverage at or below the threshold specified.

The default value is 0, which means that a reference base is considered to have low coverage when no reads cover this position. Using this threshold, if just a single read covered a particular position, only that read would contribute to the consensus at that position. Setting a higher threshold gives more confidence in the consensus sequence produced.

There are several options for how low coverage regions should be handled:

Remove regions with low coverage. When using this option, no consensus sequence is created for the low coverage regions. There are two ways of creating the consensus sequence from the remaining contiguous stretches of high coverage: either the consensus sequence is split into separate sequences when there is a low coverage region, or the low coverage region is simply ignored, and the high-coverage regions are directly joined. In this case, an annotation is added at the position where a low coverage region is removed in the consensus sequence produced (see below).
Insert 'N' ambiguity symbols. This simply adds Ns for each base in the low coverage region. An annotation is added for the low coverage region in the consensus sequence produced (see below).
Fill from reference sequence. This option uses the sequence from the reference to construct the consensus sequence for low coverage regions. An annotation is added for the low coverage region in the consensus sequence produced (see below).

Handling conflicts

Settings are provided in the lower part of the wizard for configuring how conflicts or disagreement between the reads should be handled when building a consensus sequence in regions with adequate coverage.

Vote When reads disagree at a given position, the base present in the majority of the reads at that position is used for the consensus.
- When choosing between symbols, we choose in the order A - C - G - T.
- Ambiguous symbols cannot be chosen.
If the Use quality score option is also selected, quality scores are used to decide the base to use for the consensus sequence, rather than the number of reads. The quality scores for each base at a given position in the mapping are summed, and the base with the highest total quality score at a given position is used in the consensus. If two bases have the same total quality score at a location, we follow the order of preference listed above.
Information about biological heterozygous variation in the data is lost when the Vote option is used. For example, in a diploid genome, if two different alleles are present in an almost even number of reads, only one will be represented in the consensus sequence.
Insert ambiguity codes When reads disagree at a given position, an ambiguity code representing the bases at that position is used in the consensus. (The IUPAC ambiguity codes used can be found in the Appendix.)
Unlike the Vote option, some level of information about biological heterozygous variation in the data is retained using this option.
To avoid the situation where a different base in a single read could lead to an ambiguity code in the consensus sequence, the following options can be configured:
- Noise threshold The percentage of reads where a base must be present at given position for that base to contribute to an ambiguity code. The default value is 0.1, i.e. for a base to contribute to an ambiguity code, it must be present in at least 10 % of the reads at that position.
- Minimum nucleotide count The minimum number of reads a particular base must be present in, at a given position, for that base to contribute to the consensus.
If no nucleotide passes these two thresholds at a given position, that position is omitted from the consensus sequence.
If the Use quality score option is also selected, summed quality scores are used, instead of numbers of reads for conflict handling. To contribute to an ambiguity code, the summed quality scores for bases at a given position must pass the noise threshold.

In the next step, output options are configured (figure 30.44).

Image extract_consensus_step3
Figure 30.44: Choose to add annotations to the consensus sequence.

Consensus annotations

Annotations can be added to the consensus sequence, providing information about resolved conflicts, gaps relative to the reference (deletions) and low coverage regions (if the option to split the consensus sequence was not selected). Note that for large data sets, many such annotations may be generated, which will take more time and take up more disk space.

For stand-alone read mappings, it is possible to transfer existing annotations to the consensus sequence. Since the consensus sequence produced may be broken up, the annotations will also be broken up, and thus may not have the same length as before. In some cases, gaps and low-coverage regions will lead to differences in the sequence coordinates between the input data and the new consensus sequence. The annotations copied will be placed in the region on the consensus that corresponds to the region on the input data, but the actual coordinates might have changed.

Track-based read mappings do not themselves contain annotations and thus the options related to transferring annotations, "Transfer annotations from the reference sequence" and "Keep annotations already on consensus", cannot be selected for this type of input.

Copied/transferred annotations will contain the same qualifier text as the original. That is, the text is not updated. As an example, if the annotation contains 'translation' as qualifier text, this translation will be copied to the new sequence and will thus reflect the translation of the original sequence, not the new sequence, which may differ.

Quality scores on the consensus sequence

The resulting consensus sequence (or sequences) will have quality scores assigned if quality scores were found in the reads used to call the consensus. For a given consensus symbol we compute its quality score from the "column" in the read mapping. Let be the sum of all quality scores corresponding to the "column" below , and let be the sum of all quality scores from that column that supported ^30.1. Let , then we will assign the quality score of where

$\displaystyle q = \left\{ \begin{array}{lr} 64 & \mbox{if } Q > 64 \\ 0 & \mbox{if } Q < 0 \\ Q & \mbox{otherwise} \end{array}\right.$

Footnotes