Procedure for inspecting breakpoint signatures and inferring Structural Variants

The procedure for inspecting the created LBs and RBs and inferring structural variants from them works as follows:
  1. Calculate unaligned end consensus: For each breakpoint, we calculate the consensus of the unaligned ends. We do this by simple alignment without gaps. Having created the consensus, we exclude the unaligned ends which differ by more than 20% from the consensus, and recalculate the consensus. This prevents 'spuriously' unaligned ends that extend longer than other unaligned ends from impacting the tail of the consensus unaligned end. We say that a consensus unaligned end 'exists' if it is > 4 nucleotides long. For some breakpoints no unaligned end consensus will exist - either because it is too short, or because there are no reads left to calculate the consensus from (e.g. there is no consensus unaligned end for a breakpoint with two reads with completely different unaligned ends, even when they are longer than 4 nucleotides).
  2. 'Self-map' the unaligned end consensus: For each breakpoint, if the consensus unaligned end 'exists' and is long enough to be meaningfully mapped (we use 14nt or longer), we map the consensus unaligned end to the reference surrounding the breakpoint (we search in a window extending from 100 bp upstream to 100 bp downstream of the breakpoint). If it maps, a note is made that it is self-mapped.
  3. 'Link' breakpoint signatures: This step serves to determine which breakpoint signatures originate from the same structural variant. To do this, we go through all breakpoints and create a 'group' of breakpoints consisting of the ones with which the breakpoint is 'linked'. There are two ways in which breakpoints can be linked: either as 'Close' breakpoints or as 'Cross-mapped' breakpoints. These linkings are defined as follows:
    1. 'Close' breakpoints: unaligned end breakpoints are linked as 'close breakpoints' if they are either less than 5 nucleotides or the "length of the unaligned end consensus" nucleotides apart.
    2. 'Cross-mapped' breakpoints: two unaligned end breakpoints, X and Y, are cross mapped if all of the following apply:
      • They are NOT 'Close' breakpoints.
      • One is an LB and the other an RB.
      • The unaligned end of X maps to the reference sequence around Y (we search in a window extending from 100 bp upstream to 100 bp downstream of the breakpoint Y)
      • The unaligned end of Y maps to the reference sequence around X (we search in a window extending from 100 bp upstream to 100 bp downstream of the breakpoint X)
      • the mappings above are on the same strand.
      Note that 'Cross-mapped' breakpoints are also created if there are multiple hits. E.g.: If we have breakpoints A, B, C, D, E and F, and if A maps B,C and D, and C maps to A, E and F, A and C will be cross mapped.
  4. 'Process breakpoint groups': Next we go through all the 'groups' of breakpoints created in the linking step above. Depending on the number (whether there is a single, two or more than two) and on the types of breakpoint signatures in a group, we do the following:
  5. Create structural variant features: For each structural variant signature present after step 4 - except those that extend over too large a region! - , we create a structural variant feature. For those extending over too large a region, visualization is challenging and we instead add multiple features - one for each 'end' of the variant. To allow the user to see which of these 'split features' belong together, we give features that belong to the same structural variant a common 'split group' identifier.
  6. Add repeat information: Augment the predicted structural variants with repeat information: For structural variants for which we have identified the variant sequence (there are some, e.g. larger insertions, for which this is not possible) we attempt to identify if the variant sequence contains (perfect) repeats. We do this by searching the region around the structural variant for perfect repeat sequences. The region searched is 3 times the length of variant around the insertion/deletion point.
  7. Add information regarding 'zygosity' ('Variant ratio'): Augment the predicted breakpoints and structural variants with information related to zygosity: For all unaligned end breakpoints we examine all the reads that cover the breakpoint. We count the number of reads that are mapped across this position and have either an unaligned end or insertions or deletions in their mappings. We call these the 'Non-perfect mapped reads'. The remaining mapped reads that cover the position (that is, those that are mapped in their entire length, and for which there are no insertions or deletions in the alignment) are called 'Perfect mapped reads'. For breakpoints we report the fraction of the 'Non-perfect mapped' reads among all mapped reads (that is: 'Non-perfect mapped'/('Non-perfect mapped'+'Perfect mapped')). For a Structural variant that is generated from more breakpoints, we sum the number of reads for the individual breakpoints and report the fraction of these in the 'Variant ratio' column. This fraction is intended to give some idea of the zygosity of the breakpoint or structural variant. The closer the value to 1, the higher the likelihood that the variant is homozygous. However, as the 'Non-perfect mapped reads' include all reads with any unaligned end or any insertion or deletion, and not only the ones that are in perfect accordance with the breakpoint or structural variant called, it is NOT a perfect indicator of zygosity. It does, however, often give a good indication.
  8. Add information regarding the breakpoints used to construct the structural variant: For each structural variant a number of values related to the unaligned end breakpoints used to construct the predicted variant are recorded. These are provided in order to let the user assess the degree of evidence supporting the predicted structural variant. The values are: The number of breakpoints used to construct the variant, their positions, the mapping scores and sequence complexities of the unaligned ends, as well as the numbers of reads supporting them. The mapping scores are the similarity values between the unaligned end at the region of the reference to which it was mapped. These are values between 0 and 1, and the closer to 1, the better the match and thus the more reliable the inferred variant. The sequence complexity of an unaligned end is calculated as the product of the vocabulary-usage measures for word sizes up to five, and when multiple breakpoints are used to construct a structural variant, the complexity is calculated as the product of the individual complexities of the breakpoints 26.2. The algorithm has been found to predict some (typically longer and/or of type "complex") structural variants from unaligned end breakpoints with large support in terms of unaligned end lengths and supporting reads, but with low-complexity sequences. These are likely to be false positives, and the complexity values are provided to allow the user to be alerted of these. Finally, the number of reads supporting a predicted structural variant is defined to be the sum of the numbers of reads supporting the breakpoints used to construct the structural variant.


Footnotes

... breakpoints26.2
The vocabulary usage for an oligomer of a given size, k, can be defined as the ratio of the actual vocabulary size of a given sequence to the maximal possible vocabulary size for a sequence of that length. For example, the vocabulary usages U1, U2 and U3, for the oligomer AAA are 1/4, 1/16 and 1/64, and for the sequence GAC they are 3/4, 2/16 and 1/64. The complexity of the AAA oligomer is thus $ (1/4)*(1/16)*(1/64) = 2.44 * 10^{-4}$ and of the oligomer GAC $ (3/4)*(2/16)*(1/64) = 1.46 * 10 ^{-3}$