Extract consensus sequence

A consensus sequence can be extracted from all kinds of read mappings, including those generated from de novo assembly or RNA-seq analyses. In addition, you can extract a consensus sequence from nucleotide BLAST results. The consensus sequence extraction tool can be run in batch and as part of workflows.

To start the tool:

        Toolbox | NGS Core Tools (Image ngsfolder) | Extract Consensus Sequence (Image extractconsensus_16_n_p)

This opens a dialog where you can select mappings, either in the form of tracks or read mappings, or BLAST results. Click Next to specify how the consensus sequence should be created (see figure 24.55).

Image extract_consensus_step2
Figure 24.55: Specifying how the consensus sequence should be extracted.

It is also possible to extract a consensus sequence from a mapping view by right-clicking the name of the consensus or reference sequence or a selection on the reference sequence and select Extract Consensus Sequence (Image extractconsensus_16_n_p).

When extracting a consensus sequence, you can decide how to handle regions with low coverage (a definition of coverage can be found in Reference sequence statistics). The first step is to define a threshold for when coverage is considered low. The default value is 0, which means that low coverage is defined as no coverage (i.e. no reads align to the reference at this position). That means if you have one read covering a given position, it will only be that read that determines the consensus sequence. If you need more confidence that the consensus sequence is correct, we advise raising this value. Setting a higher low coverage threshold will require more mapped reads to construct the consensus sequence.

A consensus based on mapped reads cannot be generated in regions that meet or are below the value set for the low coverage threshold, there are several options for handling these low coverage regions:

In addition to deciding how to handle low coverage regions, you can also decide how to handle conflicts or disagreement between the reads when building a consensus sequence in regions above the low coverage threshold:

Click Next to set the output option as shown in figure 24.56).

Image extract_consensus_step3
Figure 24.56: Choose to add annotations to the consensus sequence.

The annotations that can be added to the consensus sequence produced by this tool, show both conflicts that have been resolved and low coverage regions (unless you have chosen to split the consensus sequence). Please note that for large data sets, this can amount to a very high number of annotations, which will cause the tool to take longer to complete, and the result will take up much more disk space.

For stand-alone read mappings, it is possible to transfer existing annotations to the consensus sequence produced. Since the consensus sequence produced may be broken up, the annotations will also be broken up, and may not have the same length as before. In some cases, gaps and low-coverage regions will lead to differences in the sequence coordinates between the input data and the new consensus sequence. The annotations copied will be placed in the region on the consensus that corresponds to the region on the input data, but the actual coordinates might have changed.

Track-based read mappings do not themselves contain annotations and thus the options related to transferring annotations, "Transfer annotations from the reference sequence" and "Keep annotations already on consensus", cannot be selected for this type of input.

Copied/transferred annotations will contain the same qualifier text as the original. That is, the text is not updated. As an example, if the annotation contains 'translation' as qualifier text this translation will be copied to the new sequence and will thus reflect the translation of the original sequence, not the new sequence, which may differ.

The resulting consensus sequence (or sequences) will have quality scores assigned if quality scores were found in the reads used to call the consensus. For a given consensus symbol $ X$ we compute its quality score from the "column" in the read mapping. Let $ Y$ be the sum of all quality scores corresponding to the "column" below $ X$, and let $ Z$ be the sum of all quality scores from that column that supported $ X$24.1. Let $ Q=Z - (Y - Z)$, then we will assign $ X$ the quality score of $ q$ where

$\displaystyle q = \left\{
64 & \mbox{if } Q > 64 \\
0 & \mbox{if } Q < 0 \\
Q & \mbox{otherwise}


By supporting a consensus symbol, we understand the following: when conflicts are resolved using voting, then only the reads having the symbol that is eventually called are said to support the consensus. When ambiguity codes are used instead, all reads contribute to the called consensus and thus $ Y=Z$.