Annotate with Repeat and Homopolymer Information
The Annotate with Repeat and Homopolymer Information tool annotates variants with repeat and homopolymer information, based on the variant itself and the genome sequence flanking it.
Homopolymers A variant is considered to be present in a homopolymer region if there are at least 4 consecutive repeats at that location, or for deletions, next to where the deletion occurred.
Repeats A variant is considered to be in a repeat region if:
- For a 2 bp variant, there are least 4 full copies at that location on the reference, or for deletions, next to where the deletion occurred.
- For a variant of 3bp or longer, there are at least 3 full copies at that location on the reference, or for deletions, next to where the deletion occurred.
The tool looks for homopolymers and repeats in both the reference and sample sequence. The sample sequence is the same as the reference sequence but also contains the variant that is being evaluated.
To determine if there is a homopolymer or repeat in a given reference and sample region, a hidden Markov model (HMM) is used. The HMM will allow for mismatches between single nucleotides if it determines that the mismatching part is still likely part of the homopolymer or repeat. There is however a maximum number of mismatches allowed for repeating elements in the homopolymer/repeat sequence, where the comparison between repeating elements is handled differently than single nucleotides. For example, if we have the sequence TGGTGGTAA, then TGG is the repeating element and there is one mismatch between TGG and TAA. The maximum number of mismatches in repeating elements can be set when running the tool.
Note: This tool is designed for detecting shorter repeats and potential sequencing errors. Variants longer than 200 bp are therefore not evaluated and will always be marked as not being part of a homopolymer or repeat region.
To run Annotate with Repeat and Homopolymer Information, go to:
Tools | Resequencing Analysis () | Variant Annotation () | Annotate with Repeat and Homopolymer Information ()
The tool takes variant tracks () as input.
In the first dialog select the variant track that should be annotated with repeat and homopolymer information.
In the second dialog, select the reference sequence and specify the maximum number of mismatches.
This tool outputs a report, containing a summary of the results, and a variant track with the following annotations added:
- Homopolymer region The value is "Yes" if the variant is in a homopolymer region in either the reference or sample sequence, and "No" if it is not.
- Repeat region The value is "Yes" if the variant is in a repeat region in either the reference or sample sequence, and "No" if it is not.
- Reference homopolymer length The homopolymer length is found by looking to the left and right of the variant position in the reference sequence. If two adjacent homopolymers are found then the longest one is chosen (see examples below).
- Reference repeat sequence length The repeat length is found by looking to the left and right of the variant position in the reference sequence. If two adjacent repeats are found then the longest one is chosen.
- Sample homopolymer length The homopolymer length is found by looking to the left and right of the variant position in the sample sequence. If two adjacent homopolymers are found then the longest one is chosen.
- Effective sample homopolymer length If the variant is within the homopolymer observed in the reference, then this value will be the same as the sample homopolymer length. Otherwise, this will be the length of the homopolymer that contains the variant in the sample sequence (see examples below).
- Sample repeat sequence length The repeat length is found by looking to the left and right of the variant position in the sample sequence. If two adjacent repeats are found then the longest one is chosen.
- Reference repeat element The k-mer that is repeated in the reference sequence (for example TCGA in TCGATCGATCGA).
- Sample repeat element The k-mer that is repeated in the sample sequence.
The five examples below illustrate how the reference, sample, and effective sample homopolymer lengths differ (the reference and sample repeat lengths follow the same principles). Note that the top sequence is the reference, below it is the sample sequence, while the variant position is marked by *:
1. Two adjacent hompolymers
TTTTTTTTTTTTACCC TTTTTTTTTTTTCCCC *Reference: 12
Sample: 12
Effective sample: 4
2. One adjacent homopolymer
TTTTTTTTTTTTGATC TTTTTTTTTTTTCATC *Reference: 12
Sample: 12
Effective sample: 1
3. Homopolymer in sample is shorter than reference
TTTTTTTTTTTTGATC TTTTTTTTTTTAGATC *Reference: 12
Sample: 11
Effective sample: 11
4. Homopolymer in sample is longer than reference
TTTTTTTTTTTTGCCC TTTTTTTTTTTTTCCC *Reference: 12
Sample: 13
Effective sample: 13
5. Variant in the middle of the sample homopolymer
TTTTTTTTTTTT TTATTTTTTTTT *Reference: 12
Sample: 9
Effective sample: 9