Introduction

The Advanced Structural Variant Detection (beta) plugin includes a tool that identifies structural variants based on evidence from unaligned read ends and coverage information. Note that the functionality of the plugin described within this section is in beta. It is under active development and subject to change without notice.

The Advanced Structural Variant Detection (beta) tool is able to identify a greater number of variants with higher precision than the existing InDels and Structural Variants tool in the Workbench toolbox.

This tool comes with certain limitations:

The tool processes each chromosome in a genome individually, through several steps:

Coverage and complexity estimation: Each chromosome is divided into bins, where each bin keeps track of the coverage and the complexity of the region the bin covers. The complexity is calculated using the Lempel-Ziv complexity measure and is used to avoid calling structural variants in low-complexity regions. The coverage information is used together with the predicated breakpoints to determine copy number variations.

Breakpoint estimation: The tool looks for unaligned read ends at each chromosome position. A consensus sequence is then made for the unaligned region and the aligned region (one sequence for each), based on a majority count in each column of nucleotides. Breakpoints are labeled either as a 'left' or 'right' breakpoint. This labeling is from the perspective of a deletion, where a left breakpoint is on the left side of a deletion (which means there is a right unaligned end) and a right breakpoint is vice-versa on the right side of the deletion. The tool then applies a mathematical model, that based on the ploidy of the sample estimates how likely the breakpoint is to support a structural variant.

Copy number variation: The copy number variation (CNV) detection part of the algorithm looks for deletions and duplications. The algorithm uses the counts in each bin to determine if there is a statistically significant difference when compared with a normal distribution, where the normal distribution is modeled on the basis of the mean and standard deviation for the counts across all the bins (this part of the algorithm is based on [Yoon et al., 2009]. When there is a consecutive number of statistically significant bins, they are combined using Fisher's method to calculate a total significance value, which is then used to determine if there is a CNV.

Resolving structural variants: after the potential breakpoints have been established, consensus sequences for the aligned part of the reads, and for the unaligned ends for pairs of left-and right breakpoints are then aligned. The alignment scores for each possible pairing of left and right breakpoint alignments are stored in a matrix, and a dynamic programming algorithm is used to identify the most likely pairing of breakpoints. Breakpoints that were not matched in the previous step are then compared to the detected CNVs in order to resolve additional structural variants based on this coverage information.