Structural Variant Caller

The Structural Variant Caller identifies structural variants in read mappings based on evidence from unaligned read ends and coverage information. It builds on the same ideas around unaligned end read signatures as the existing InDels and Structural Variants tool, but to a larger extent relies on statistical reasoning and more refined components for consensus generation, mapping and alignment of the unaligned end sequences.

The tool,

The tool has the following limitations:

The tool processes each chromosome in a genome individually, through several steps:

Breakpoint estimation: The tool looks for unaligned read ends at each chromosome position. Consensus sequences are constructed for the unaligned ends and aligned regions across the reads at a breakpoint (one consensus sequence for the unaligned end and one for the aligned region). The consensus sequence is based on a majority count of k-mers for the unaligned end, while the nucleotide count in each column is used for the aligned region. Breakpoints are labeled either as a 'left' or 'right' breakpoint. This labeling is from the perspective of a deletion, where a left breakpoint is on the left side of a deletion (which means there is a right unaligned end) and a right breakpoint is vice-versa on the right side of the deletion. For WGS applications, the tool makes a probabilistic assessment of how likely the breakpoint is to support a structural variant based on the coverage, the unaligned end read count, and the specified ploidy of the sample.

Coverage and complexity estimation (WGS applications only): each chromosome is divided into bins. The tool then calculates the coverage and the complexity of the reference region in each bin. The complexity is calculated using the Lempel-Ziv complexity measure and is used to avoid calling structural variants in low-complexity regions, while the coverage information, is stored and, together with breakpoint information, used to find potential copy number variations (see below).

Resolving structural variants: after breakpoints have been established, different combinations of left and right breakpoints are paired together. For each pair, the unaligned and aligned consensus sequences from one breakpoint are aligned to the other breakpoint. The alignment scores from each possible pairing are then stored in a matrix, and a dynamic programming algorithm is used to identify which breakpoints to pair together. Breakpoints that were not matched in this step are then each used as a single breakpoint to search for additional structural variants, either in terms of (1) smaller insertions or deletions inferred from self-mapping evidence (where the unaligned consensus itself maps back to nearby its own location), or (2) as supporting evidence for CNV losses or gains inferred from the coverage analysis (see below).

Copy number variation (WGS applications only): the counts in each bin along the chromosome are used to determine if there is a statistically significant difference when compared with a normal distribution, where the normal distribution is modeled on the basis of the mean and standard deviation for the counts across all the bins (this part of the algorithm is based on [Yoon et al., 2009]). When there is a consecutive number of statistically significant bins, they are combined using Fisher's method to calculate a total significance value, which is then used to determine if there is a CNV. Each CNV is then used with any nearby breakpoints to determine if there is a deletion or duplication present.

Running the Structural Variant Caller tool

To run the Structural Variant Caller tool, go to:

        Toolbox | Resequencing Analysis (Image resequencing) | Variant Detection (Image variant_detection_folder_closed_16_h_p) | Structural Variant Caller (Image structural_variation_detection_16_n_p)

Once the tool wizard has opened (figure 25.18), choose the read mapping you would like to analyze. The Structural Variant Caller tool accepts read mappings as either reads tracks or stand-alone read mappings.

Image advstrucvardet1
Figure 25.18: Select one or several reads tracks or stand-alone read mappings.

In the next wizard step, specify the ploidy and application for the sample you are analyzing (figure 25.19). You can also specify to ignore broken pair reads. Ignoring broken pairs will typically reduce the computational time of the analysis. It may have a negative impact on sensitivity, but may also improve precision, depending on the source of the broken pair reads.

Image structvar_application
Figure 25.19: Set the application parameters for the tool and specify if broken pair reads should be ignored.

In the next steps you are asked to specify filter settings. The settings depend on whether you have specified the whole genome sequencing or the targeted application. The filter settings for the whole genome sequencing application (figure 25.20) are:

Image structvar_wgsfilters
Figure 25.20: Set filters for the whole genome sequencing applications.

For the targeted application the filters are (figure 25.21):

Image structvar_targetedfilters
Figure 25.21: Set filters for the targeted sequencing applications.



Subsections