Annotate with Repeat and Homopolymer Information

The Annotate with Repeat and Homopolymer Information tool annotates variants with repeat and homopolymer information, based on the variant itself and the genome sequence flanking it.

Homopolymers A variant is considered to be present in a homopolymer region if there are at least 4 consecutive repeats at that location, or for deletions, next to where the deletion occurred.

Repeats A variant is considered to be in a repeat region if:

The tool looks for homopolymers and repeats in both the reference and sample sequence. The sample sequence is the same as the reference sequence but also contains the variant that is being evaluated.

To determine if there is a homopolymer or repeat in a given reference and sample region, a hidden Markov model (HMM) is used. The HMM will allow for mismatches between single nucleotides if it determines that the mismatching part is still likely part of the homopolymer or repeat. There is however a maximum number of mismatches allowed for repeating elements in the homopolymer/repeat sequence, where the comparison between repeating elements is handled differently than single nucleotides. For example, if we have the sequence TGGTGGTAA, then TGG is the repeating element and there is one mismatch between TGG and TAA. The maximum number of mismatches in repeating elements can be set when running the tool.

Note: This tool is designed for detecting shorter repeats and potential sequencing errors. Variants longer than 200 bp are therefore not evaluated and will always be marked as not being part of a homopolymer or repeat region.

To run Annotate with Repeat and Homopolymer Information, go to:

        Tools | Resequencing Analysis (Image resequencing) | Variant Annotation (Image variant_annotate_folder_closed_16_h_p) | Annotate with Repeat and Homopolymer Information (Image annotate_repeat_regions_16_n_p)

The tool takes variant tracks (Image variant_track_16_n_p) as input.

In the first dialog select the variant track that should be annotated with repeat and homopolymer information.

In the second dialog, select the reference sequence and specify the maximum number of mismatches.

This tool outputs a report, containing a summary of the results, and a variant track with the following annotations added:

The five examples below illustrate how the reference, sample, and effective sample homopolymer lengths differ (the reference and sample repeat lengths follow the same principles). Note that the top sequence is the reference, below it is the sample sequence, while the variant position is marked by *:

1. Two adjacent hompolymers

TTTTTTTTTTTTACCC
TTTTTTTTTTTTCCCC
            *
Reference: 12
Sample: 12
Effective sample: 4


2. One adjacent homopolymer
TTTTTTTTTTTTGATC
TTTTTTTTTTTTCATC
            *
Reference: 12
Sample: 12
Effective sample: 1


3. Homopolymer in sample is shorter than reference
TTTTTTTTTTTTGATC
TTTTTTTTTTTAGATC
           *
Reference: 12
Sample: 11
Effective sample: 11


4. Homopolymer in sample is longer than reference
TTTTTTTTTTTTGCCC
TTTTTTTTTTTTTCCC
            *
Reference: 12
Sample: 13
Effective sample: 13


5. Variant in the middle of the sample homopolymer
TTTTTTTTTTTT
TTATTTTTTTTT
  *
Reference: 12
Sample: 9
Effective sample: 9