Copy Number Variant Detection (WGS)

The Copy Number Variant Detection (WGS) tool is designed to identify copy number variants (CNVs) from whole genome sequencing (WGS) data including low-pass WGS data.

The tool takes a read mapping as input and is designed to not rely on control samples. This is achieved by estimating the expected coverage for diploid regions via the following steps:

If the tool is unable to find a robust cluster, the median coverage for all chromosomes will be used as the expected coverage for diploid regions.

The presence of XX or XY chromosomes is automatically determined by the tool based on the observed coverage. Note that only XX or XY can be assigned by the tool.

The tool defines non-overlapping windows that mapped reads are divided into, and calculates a coverage in each window that is adjusted for mapping quality and GC content. The resulting coverage in each window is normalized using the expected coverage for diploid regions. If any masking tracks are provided, the coverage will be ignored in the regions defined by these.

CNVs are detected using the normalized window coverage values with a hidden Markov model (HMM). The HMM consists of 11 different states for diploid chromosomes that each represent a copy number (0-10). Only 10 states (0-9) are used for haploid chromosomes, i.e. when there is both an X and a Y chromosome. For each window, we calculate a probability for each state based on the normalized coverage value and the sample purity. The HMM considers each window as an event, and then tries to find the most likely sequence of copy number states that explains these events. The boundaries of detected CNVs are refined by using a window of half the size of the original window.

A coefficient of variation is calculated based on the coverage windows across all chromosomes. This is used to set the copy number state transition probabilities in the HMM. This serves the purpose of making it less likely to detect a CNV when using noisy data. The coefficient of variation is also used for automatically determining the window size.

The HMM calculates a CNV score by initially obtaining the probability of the sequence of events that occurred (i.e. the states it traversed). Next, the probability that there is no CNV is found using the same calculations, but where only copy neutral states are traversed. The CNV score is finally calculated as the log-ratio between these two probabilities.

To run Copy Number Variant Detection (WGS):

        Tools | LightSpeed (Image lightspeed_folder_open_16_n_p) | Copy Number Variant Detection (WGS) (Image cnv_16_n_p)

If you are connected to a CLC Server via your Workbench, you will be asked where you would like to run the analysis. We recommend that you run the analysis on a CLC Server when possible.

In the first wizard step, select a read mapping.

Next, options are available for window size selection and sequence masking (figure 3.28):

Tracks containing centromere regions, PAR and Umap mappability scores are available for hg19 and hg38_no_alt_analysis_set in the Reference Data Manager. The tracks contain regions that are known to exhibit systematic abnormal coverage for different reasons, and we recommend masking with all three tracks when calling CNVs. Note that windows with more than 25% ambiguous nucleotides, i.e. N, will automatically be masked. In addition, detected CNVs are required to have less than 50% overlap with centromere and PAR regions and less than 40% overlap with N-masked regions.

Image CNV_WGS_3
Figure 3.28: Specify how the window size should be determined and provide tracks for sequence masking.

In the sample step, the following options are available (figure 3.29):

Image CNV_WGS_4
Figure 3.29: Provide sample information for CNV detection.

Next, options that can be used to filter CNVs are available (figure 3.30):

Image CNV_WGS_5
Figure 3.30: Specify filtering for CNV detection.

The Copy Number Variant Detection (WGS) tool has the following limitations:



Subsections