VCF import
Handling of the genotype (GT) field
The import process for VCF files into CLC Genomics Workbench currently works as follows:
- In cases where GT = ./., no variants are imported at all.
- In cases where GT = X/. or GT = ./X , and where X is not zero, a single variant is imported depending on the actual value of X.
- In cases where GT = X/Y and X and Y are different but either one may be zero, two independent variants are created.
Note: The GT field is mandatory for import of sample variants (i.e., when FORMAT and sample columns are present).
Import of counts
To add variant count values to the imported variants, one of the following tags must be present in your VCF file: CLCAD2, AD, AO, or RO. Where more than one of these is present, they are prioritized in the following order:
- CLCAD2
- AD
- AO and/or RO
Count values will be taken from the tag type with the highest priority, with values for other tags imported as annotations.
For example, if a VCF file has CLCAD2:AD for three possible variants with values 2,3,4:5,6,7, then the CLCAD2 values would be imported as counts, with each variant having a single count value (2,3,4 respectively), while the AD value for each variant would be included as an annotation (5,6,7 respectively).
Import of multiple samples and multiple VCF files
When importing a single VCF file, you will get a track for each sample contained in the VCF file.
In cases where information about more than one sample is present in the VCF file, you can choose to import the samples together into a single variant track, or import each sample into an individual variant track by checking the batch mode button in the lower left side of the wizard, as shown in figure 7.2. The difference between the two import modes is that the batch mode will import the samples individually in separate track files, whereas the non-batch mode will keep variants for one sample in one track, thus merging samples from the different input files (in cases where the same sample is contained in different input files).
If you select multiple VCF files, each containing multiple samples, then the non-batch mode will generate one track file for each unique sample. The batch mode will generate a track file for each of the original VCF files with the entire content, as if importing each of the VCF files one by one. For example, VCF file 1 contains sample 1 and sample 2, and VCF file 2 contains sample 2 and sample 3. When VCF file 1 and VCF file 2 are imported in non-batch mode, you will get three individual track files; one for each of the three samples 1, 2, and 3. If VCF file 1 and VCF file 2 were instead imported using the batch function, the result of the import would be four track files: a track from sample 1 from file 1, a track from sample 2 from file 1, a track from sample 2 from file 2, and a track from sample 3 from file 2.
Import of complex variants with reference overlap
Allelic variants that overlap but do not cover exactly the same range are called complex variants.
It can be specified that variants are represented using reference overlap by adding the line "##refOverlap=true" in the VCF header. If no such line is found in the header, the default is "false", i.e., that no reference overlap alleles are present that need to be replaced by overlapping alleles.
- Detection of complex regions:
When reading a reference overlap VCF file, a complex region is initiated when overlapping alleles are called on different VCF lines. Complex regions can contain hundreds of complex variants, for example if one allele has a long deletion. Alleles overlap if they share a reference nucleotide position. Insertions overlap non-insertion if they are positioned internally, not if they are positioned at either boundary.
- Replacing reference overlap alleles in complex regions:
For each position with a complex alternate allele, a number of placeholder reference overlap alleles (refoPloidy) are expected to be present, so that the total number of alleles in the genotype field is equal to the ploidy at that position in the sample genome. For each such position in the complex region, it is then determined how many reference overlap alleles are replaced by overlapping alternate and reference alleles (numReplaced). If any reference overlap alleles remain, they are assigned the allele depth: newAD=origAD*(refoPloidy-numReplaced)/refoPloidy, where origAD is the original allele depth for all reference overlap alleles at the position. In the "Reference overlap and depth estimate" example above (Table 2), the allele depth of the re-imported reference variant will be: newAD=6*(2-1)/2=3. In the "Reference overlap" example above (Table 2), no reference overlap alleles will remain (numReplaced=2).
- Alternative import of "Reference overlap" representation: The method above can be used for both "Reference overlap" and "Reference overlap with depth estimate" representations. However, a VCF file generated with the "Reference overlap" representation can also be imported correctly by simply importing as if it has no reference overlap, and subsequently removing all reference alleles with zero CLCAD2 allele depth.
Read more about complex variants with reference overlap in section Complex variant representations and VCF reference overlap.
Import variants represented as symbolic alleles
VCF Import supports the following symbolic alleles:
- <DEL> - Deletions
- <INS> - Insertions
- <INV> - Inversions
- <DUP:TANDEM> - Tandem duplications
If possible, variants are imported to standard variant tracks. However, variants longer than 100,000 base pairs and variants that do not contain sufficient sequence information are imported to annotation tracks. Read about track types in Track types.
The following variants are imported to annotation tracks if represented as symbolic alleles in the VCF:
- Deletions longer than 100,000 base pairs
- All insertions
- All inversions
- All tandem duplications
Note that tandem duplications can also be represented as insertions, as described for the InDels ouput from InDels and Structural Variants in The Structural Variants and InDels output.