Copy Number Variant Detection (WGS) (beta)

The Copy Number Variant Detection (WGS) (beta) tool is designed to identify copy number variants (CNVs) from whole genome sequencing (WGS) data including low-pass WGS data.

The tool takes a read mapping as input and is designed to not rely on control samples. This is achieved by estimating the expected coverage for diploid regions via the following steps:

If the tool is unable to find a suitable cluster, the median coverage for all chromosomes will be used as the expected coverage for diploid regions.

The presence of XX or XY chromosomes is automatically determined by the tool based on the observed coverage.

The tool defines non-overlapping windows that mapped reads are divided into, and calculates a coverage in each window that is adjusted for mapping quality and GC content. The resulting coverage in each window is normalized using the expected coverage for diploid regions. If any masking tracks are provided, the coverage will be ignored in the regions defined by these.

CNVs are detected using the normalized window coverage values with a hidden Markov model (HMM). The HMM consists of five different states for diploid chromosomes that each represent a copy number (0, 1, 2, 3, or 4). Only four states (0, 1, 2, or 3) are used for haploid chromosomes, i.e. when there is both an X and a Y chromosome. For each window, we calculate a probability for each state based on the normalized coverage value and the sample purity. The HMM considers each window as an event, and then tries to find the most likely sequence of copy number states that explains these events. The boundaries of detected CNVs are refined by using a window of half the size of the original window.

A coefficient of variation is calculated based on the coverage windows across all chromosomes. This is used to set the copy number state transition probabilities in the HMM. This serves the purpose of making it less likely to detect a CNV when using noisy data. The coefficient of variation is also used for automatically determining the window size.

The HMM calculates a CNV score by initially obtaining the probability of the sequence of events that occurred (i.e. the states it traversed). Next, the probability that there is no CNV is found using the same calculations, but where only copy neutral states are traversed. The CNV score is finally calculated as the log-ratio between these two probabilities.

To run Copy Number Variant Detection (WGS) (beta):

        Tools | LightSpeed (Image lightspeed_folder_open_16_n_p) | Copy Number Variant Detection (WGS) (beta) (Image cnv_16_n_p)

If you are connected to a CLC Server via your Workbench, you will be asked where you would like to run the analysis. We recommend that you run the analysis on a CLC Server when possible.

In the first wizard step, select a read mapping (figure 3.27).

Image CNV_WGS_1
Figure 3.27: Select a read mapping.

Next, specify regions that should be masked (figure 3.28). You can choose to mask centromeres, pseudoautosomal regions, and/or regions included in a blacklist masking track. Windows with more than 25% ambiguous nucleotides, i.e. N, will automatically be masked.

Image CNV_WGS_2
Figure 3.28: Provide tracks for sequence masking.

Centromere regions are repetitive which can result in the detection of false positive CNVs. Pseudoautosomal regions (PAR) are homologous between the X and Y chromosome which similarly can result in the detection of false positive CNVs. Tracks containing centromere regions and PAR are available for hg19 and hg38_no_alt_analysis_set in the Reference Data Manager. Excluding these regions is beneficial for estimating the expected coverage for diploid regions. Identified CNVs cannot overlap with centromere regions and/or PAR. We recommend masking centromere regions and PAR for CNV detection.

Blacklist regions can be used to mask regions where false positive CNVs are often detected, e.g. in regions with high or low mappability. See for example the ENCODE Blacklist described in (regions are available for download in the Data Availability section of the paper):

An identified CNV is discarded if more than half of the region it covers overlaps with blacklist regions. We recommend using blacklist regions such as the ENCODE Blacklist for CNV detection.

In the settings step, the following options are available (figure 3.29):

Image CNV_WGS_3
Figure 3.29: Specify settings for CNV detection.

Next, options that can be used to filter CNVs are available (figure 3.30):

Image CNV_WGS_4
Figure 3.30: Specify filtering for CNV detection.

The Copy Number Variant Detection (WGS) (beta) tool has the following limitations:



Subsections