Annotation and variant formats
Please note that all of the annotation and variant formats can be imported as tracks (see Import tracks). GFF, GVF and GTF formats can also be imported as annotations on a standard (i.e., non-track) sequence or sequence list using functionality provided by the Annotate with GFF plugin (http://www.qiagenbioinformatics.com/plugins/annotate-with-gff-file/).
File type | Suffix | Import | Export | Description |
---|---|---|---|---|
Annotation CSV export | .csv | X | Annotations in csv format | |
Annotation Excel 2010 | .xlsx | X | Annotations in Excel format | |
Annotation Excel 97 - 2007 | .xls | X | Annotations in Excel format | |
VCF | .vcf | X | X | See note below |
GFF | .gff | X | To import as annotation track, see Import tracks. | |
GVF | .gvf | X | X | Special version of GFF for variant data. See GFF entry above. |
GTF | .gtf | X | X | Special version of GFF for gene annotation data. See GFF entry above. |
GFF3 | .gff3 | X | X | To import and export as annotation track, see GFF3 format. |
COSMIC variation database | .tsv | X | Special format for COSMIC data | |
BED | .bed | X | X | See Import tracks |
Wiggle | .wig | X | X | See Import tracks |
UCSC variant database table dump | .txt | X | See Import tracks |
Special note on former VCF export
In CLC Genomics Workbench 6.5 instead of the CLCAD2, the CLCAD field had been reported. The difference between CLCAD and CLCAD2 is that the former is following the order in the GT (genotype) field in VCF, while the latter is following the order of the REF and ALT fields in VCF in is therefore more in line with the AD field reported from GATK and other sources.
Special notes on VCF import
Note! Please also see Import tracks.
The import process for VCF files into the CLC Genomics Workbench currently work as follows:
- In cases where GT = ./., no variants are imported at all.
- In cases where GT = X/. or GT = ./X , and where X is not zero, a single variant is imported depending on the actual value of X.
- In cases where GT = X/X and X is not zero, in Genomics Workbench 6.5 this will result in two independent variants. In version 6.5.1 they will be reported as a single homozygous variant.
- In cases where GT = X/Y and X and Y are different but either one may be zero, two independent variants are created.
Please note that some replacements cannot be interpreted in versions that are older than CLC Genomics Workbench 7.0; such replacements will therefore not be imported in previous versions of CLC Genomics Workbench.
An example of these types of replacements are the following:
chr2 32843292 . TTTA T,TTT 100 PASS DP=44
Due to the VCF interpretation, the initial T base is removed from all alternatives.
In version 6.5.x, only the reference allele (TTA -> TTA) and the first deletion in ALT (TTA -> -) will be imported. The replacements TTA -> TT will not be imported.
In version 7.0 and later versions, all variants will be imported, but the replacements TTA -> TT will be imported as one deletion A -> - .
To get a variant count as part of your imported variant, one of the following VCF tags have to be present in your VCF file: CLCAD2, AD, or AO.
The import of CLCAD2/AD/AO tags are prioritized in the following order:
- CLCAD2
- AD
- AO
If the CLCAD2 is missing, and only AD is present, then AD is used in the "count" column.
The consequence of this, if the file for example has CLCAD2:AD, and in a sample for three possible variants the values are 2,3,4:5,6,7, then the CLCAD2 tag will be imported as count, so each of the three variants will have just one count value (2, 3, and 4 respectively). At the same time, the AD tag will be imported as an annotation so all of the three variants will have "5,6,7" under the AD column, like for any unknown format tag.
Special notes on chromosome names synonyms used during import
When importing annotations as tracks, we try to make things simple for the user by having a set of chromosome names that are recognized as synonyms. The check on the chromosome name comparison is made by looking through the chromosomes in the order in which they are registered in the genome. The first match with any of the synonym names for a given chromosome is the chromosome to which the information will be added.
The synonyms applied are:
For any number N between (including) 1 and 22:
N, chrN, chromosome_N, and NC_00000N are seen as meaning the same thing. As concrete examples:
1 == chr1 == chromosome_1 == NC_000001
22 == chr22 == chromosome_22 == NC_000022
For any number N larger than 23:
N, chrN, chromosome_N are seen as meaning the same thing. As a concrete example:
26 == chr26 == chromsome_26
For chromsome names with letters, not numbers:
X, chrX, and chromosome_X and NC_000023 are synonyms.
Y, chrY, chromosome_Y and NC_000024 are synonyms.
M, MT, chrM, chrMT, chromosome_M, chromosome_MT and NC_001807 are synonyms.
The accession numbers in the listings above (NC_XXXXXX) allow for the matching against NCBI hg19 human reference names against the names used by USCS and vitally, the names used by Ensembl. Thus, in this case, if you have the correct number of chromosomes in a human reference (i.e. 25 references, including the hg19 mitochondria), that set of tracks can be used as the basis for downloading/importing annotations via Download Genomes, for example.
Note: These rules only apply for importing annotations as tracks, whether that is directly or via Download Genomes. Synonyms are not applied when doing BAM imports or when using the Annotate with GFF plugin. There, your reference names in the Workbench must exactly match the references names used in your BAM file or GFF/GTF/GVF file respectively.