Annotation and variant formats
File type | Suffix | Import | Export | Description |
VCF | .vcf | X | X | See note below |
GFF | .gff | X | X | To import as annotation track, see Import tracks. To annotated sequence or sequence list, see plugin: http://www.clcbio.com/clc-plugin/annotate-sequence-with-gff-file/ |
GVF | .gvf | X | X | Special version of GFF for variant data. See GFF entry above. |
GTF | .gtf | X | X | Special version of GFF for gene annotation data. See GFF entry above. |
COSMIC variation database | .tsv | X | Special format for COSMIC data | |
BED | .bed | X | See Import tracks | |
Wiggle | .wig | X | See Import tracks | |
UCSC variant database table dump | .txt | X | See Import tracks | |
Complete genomics master var files | masterVar | X | Complete genomics variant data format |
Special note on VCF export
For VCF export, counts from the variant track are put in CLCAD2 tags and coverage in DP tags. The values of the CLCAD2 tag follow the order of REF and ALT, with one value for the REF and for each ALT. For example if there has been a homozygote variant identified at a certain position, the value of the GT field is 1/1 and the corresponding CLCAD2 value for the reference allele will be 0, which is always the first number in the CLCAD2 field. Please note that this does not mean the original mapping did not have any reads with that sequence, but it means that the variant track being exported does not contain the reference allele.
When exporting VCF files, there are three options:
- Reference sequence track
- Since the VCF format specifies that reference and allele sequences cannot be empty, deletions and insertions have to be padded with bases from the reference sequence. The export needs access to the reference sequence track in order to find the neighboring bases.
- Enforce diploid export
- The CLC Genomics Workbench option will generate a VCF
file in which the allele values in the Genotype (GT) field for haploid
variants are reported following the format for diploid variants (i.e. the GT
allele values reported are 1/1). This is to ensure compatibility of the
exported VCF file with programs for downstream variant analysis that expect
strictly diploid genomes. The user can specify that the Enforce diploid option
is only applied to certain chromosomes, while others may be reported as
haploid.
If you export a variant track that has been filtered, there can be situations
where there is only one heterozygous variant at a given position. In this
case, the CLC Genomics Workbench will use a "." to denote an unknown genotype, so the GT field will be "1/.".
Note: the "Enforce diploid" option does NOT enforce diploidy for polyploid variant loci. Regardless of this setting, all variant alleles reported during variant calling are included in the exported VCF file.
It is important to note that this Enforce diploid export option will create a diploid format of the VCF file, but it is not able to recover any inconsistencies in the variant track used as input. If the variant track has three variants at a given position, three genotypes will be output. Or if the variant track has two variants at the same position that both postulate to be homozygous, they will be output as two heterozygous variants. When exporting data created by the variant callers of CLC Genomics Workbench, this is usually not a problem, but when applying this diploid scheme to data that has been imported into the CLC Genomics Workbench from other sources, the data can be inconsistent with a diploid model.
- Exceptions
- Some chromosomes can be excepted from the enforced diploid export. For a human genome, that would be relevant for the mitochondrion and for male X and Y chromosomes. For this option, you can select which chromosomes should be excepted. They will be exported in the standard way without assuming there should be two genotypes, and homozygous calls will just have one value in the GT field.
Special note on former VCF export
In CLC Genomics Workbench 6.5 instead of the CLCAD2, the CLCAD field had been reported. The difference between CLCAD and CLCAD2 is that the former is following the order in the GT (genotype) field in VCF, while the latter is following the order of the REF and ALT fields in VCF in is therefore more in line with the AD field reported from GATK and other sources.
Special notes on VCF import
The import process for VCF files into the CLC Genomics Workbench currently work as follows:
- For VCF rows that are reporting the reference base no variants are imported
- In cases where GT = 0/0, GT=./., GT=0/. or GT=./0 no variants are imported at all
- In cases where GT = X/. or GT = ./X , and where X is not zero, a single variant is imported depending on the actual value of X
- In cases where GT = X/X and X is not zero, in Genomics Workbench 6.5 this will result in two independent variants. In version 6.5.1 they will be reported as a single homozygous variant
- In cases where GT = X/Y and X and Y are different but either one may be zero, two independent variants are created
Please note that some replacements can not be interpreted before CLC Genomics Workbench version 7.0 and therefore will be in previous version not be imported.
An example of these types of replacements are the following:
chr2 32843292 . TTTA T,TTT 100 PASS DP=44
Due to the VCF interpretation, the initial T base is removed from all alternatives.
In version 6.5.x, only the reference variant (TTA -> TTA) and the first deletion in ALT (TTA -> -) will be imported. The replacements TTA -> TT will not be imported.
In version 7.0 and later versions, all variants will be imported, but the replacements TTA -> TT will be imported as one deletion A -> - .
Special notes on chromosome names synonymes used during import
When importing annotations as tracks, we try to make things simple for the user by having a set of chromosome names that are recognized as synonyms. The check on the chromosome name comparison is made by looking through the chromosomes in the order in which they are registered in the genome. The first match with any of the synonym names for a given chromosome is the chromosome to which the information will be added.
The synonyms applied are:
For any number N between (including) 1 and 22:
N, chrN, chromosome_N, and NC_00000N are seen as meaning the same thing. As concrete examples:
1 == chr1 == chromosome_1 == NC_000001
22 == chr22 == chromosome_22 == NC_000022
For any number N larger than 23:
N, chrN, chromosome_N are seen as meaning the same thing. As a concrete example:
26 == chr26 == chromsome_26
For chromsome names with letters, not numbers:
X, chrX, and chromosome_X and NC_000023 are synonyms.
Y, chrY, chromosome_Y and NC_000024 are synonyms.
M, MT, chrM, chrMT, chromosome_M, chromosome_MT and NC_001807 are synonyms.
The accession numbers in the listings above (NC_XXXXXX) allow for the matching against NCBI hg19 human reference names against the names used by USCS and vitally, the names used by Ensembl. Thus, in this case, if you have the correct number of chromosomes in a human reference (i.e. 25 references, including the hg19 mitochondria), that set of tracks can be used as the basis for downloading/importing annotations via Download Genomes, for example.
Note: These rules only apply for importing annotations as tracks, whether that is directly or via Download Genomes. Synonyms are not applied when doing BAM imports or when using the Annotate with GFF plugin. There, your reference names in the Workbench must exactly match the references names used in your BAM file or GFF/GTF/GVF file respectively.