Extract Consensus Sequence

Using the Extract Consensus Sequence tool, a consensus sequence can be extracted from all kinds of read mappings, including those generated from de novo assembly or RNA-Seq analyses. In addition, you can extract a consensus sequence from nucleotide BLAST results.

Note: Consensus sequences can also be extracted when viewing a read mapping by right-clicking on the name of the consensus or reference sequence, or a selection of the reference sequence, and selecting the option Extract New Consensus Sequence (Image extractconsensus_16_n_p) from the menu that appears. The same option is available from the graphical view of BLAST results when right-clicking on a selection of the subject sequence.

To start the Extract Consensus Sequence tool, go to:

        Toolbox | Resequencing Analysis (Image resequencing) | Extract Consensus Sequence (Image extractconsensus_16_n_p)

In the first step, select the read mappings or nucleotide BLAST results to work with.

In the next step, options affecting how the consensus sequence is determined are configured (see figure 30.43).

Image extract_consensus_step2
Figure 30.43: Specifying how the consensus sequence should be extracted.

Handling low coverage regions

The first step is to define a low coverage threshold. Consensus sequence is not generated for reference positions with coverage at or below the threshold specified.

The default value is 0, which means that a reference base is considered to have low coverage when no reads cover this position. Using this threshold, if just a single read covered a particular position, only that read would contribute to the consensus at that position. Setting a higher threshold gives more confidence in the consensus sequence produced.

There are several options for how low coverage regions should be handled:

Handling conflicts

Settings are provided in the lower part of the wizard for configuring how conflicts or disagreement between the reads should be handled when building a consensus sequence in regions with adequate coverage.

In the next step, output options are configured (figure 30.44).

Image extract_consensus_step3
Figure 30.44: Choose to add annotations to the consensus sequence.

Consensus annotations

Annotations can be added to the consensus sequence, providing information about resolved conflicts, gaps relative to the reference (deletions) and low coverage regions (if the option to split the consensus sequence was not selected). Note that for large data sets, many such annotations may be generated, which will take more time and take up more disk space.

For stand-alone read mappings, it is possible to transfer existing annotations to the consensus sequence. Since the consensus sequence produced may be broken up, the annotations will also be broken up, and thus may not have the same length as before. In some cases, gaps and low-coverage regions will lead to differences in the sequence coordinates between the input data and the new consensus sequence. The annotations copied will be placed in the region on the consensus that corresponds to the region on the input data, but the actual coordinates might have changed.

Track-based read mappings do not themselves contain annotations and thus the options related to transferring annotations, "Transfer annotations from the reference sequence" and "Keep annotations already on consensus", cannot be selected for this type of input.

Copied/transferred annotations will contain the same qualifier text as the original. That is, the text is not updated. As an example, if the annotation contains 'translation' as qualifier text, this translation will be copied to the new sequence and will thus reflect the translation of the original sequence, not the new sequence, which may differ.

Quality scores on the consensus sequence

The resulting consensus sequence (or sequences) will have quality scores assigned if quality scores were found in the reads used to call the consensus. For a given consensus symbol $ X$ we compute its quality score from the "column" in the read mapping. Let $ Y$ be the sum of all quality scores corresponding to the "column" below $ X$, and let $ Z$ be the sum of all quality scores from that column that supported $ X$30.1. Let $ Q=Z - (Y - Z)$, then we will assign $ X$ the quality score of $ q$ where

$\displaystyle q = \left\{
64 & \mbox{if } Q > 64 \\
0 & \mbox{if } Q < 0 \\
Q & \mbox{otherwise}


By supporting a consensus symbol, we understand the following: when conflicts are resolved using voting, then only the reads having the symbol that is eventually called are said to support the consensus. When ambiguity codes are used instead, all reads contribute to the called consensus and thus $ Y=Z$.