Introduction
Annotate with GFF File makes it very easy to annotate a sequence with annotations from a GFF (Generic Feature Format) or GTF (Gene Transfer Format) file. A GFF/GTF file does not contain any sequence information, it only contains a list of annotations. You can read more about the formats at http://www.sanger.ac.uk/resources/software/gff/spec.html and http://mblab.wustl.edu/GTF22.html.
There are many different versions of GFF and GTF. We support a big part of the GFF3 definition (see http://www.sequenceontology.org/gff3.shtml), and we also support GTF format as defined at http://mblab.wustl.edu/GTF22.html. In other words, most GFF3 files can be used to annotated sequences using this tool.
The GFF and GTF files can contain various types of annotations. In general, the Annotate with GFF File action adds the annotation in each of the lines in the file to the chosen sequence, at the position or region in which the file specifies that it should go, and with the annotation type, name, description etc. as given in the file. However, special treatment is given to annotations of the types CDS, exon, mRNA, transcript and gene. For these, the following applies:
- A gene annotation is generated for each gene_id. The region annotated extends from the leftmost to the rightmost positions of all annotations that have the gene_id (gtf-style).
- CDS annotations that have the same transcriptID are joined to one CDS annotation (gtf-style). Similarly, CDS annotations that have the same parent are joined to one CDS annotation (gff-style).
- If there are more than one exon annotation with the same transcriptID these are joined to one mRNA annotation. If there is only one exon annotation with a particular transcriptID, and no CDS with this transcriptID, a transcript annotation is added instead of the exon annotation (gtf-style).
- Exon annotations that have the same mRNA as parent are joined to one mRNA annotation. Similarly, exon annotations that have the same transcript as parent, are joined to one transcript annotation (gff-style).
Note that genes and transcripts are linked by name only (not by position, ID etc). For a comprehensive source of genomic annotation of genes and transcripts, we refer to the Ensembl web site at http://www.ensembl.org/info/data/ftp/index.html. On this page, you can download GTF files that can be used to annotate genomes for use in other analyses in the CLC Genomics Workbench.
This manual will show two examples of how to use the plugin to annotate a genome for the purposes of RNA-Seq analysis in the CLC Genomics Workbench version 6.5.x and earlier.
If you are using the CLC Genomics Workbench and are interested in standard reference genomic data, please also take a look at the Download Genomes tool as described in the CLC Genomics Workbench manual at: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Download_reference_genome_data.html.