GFF3 format

A GFF3 file contains a list of various types of annotations that can be linked together with "Parent" and "ID" tags.

Here are some example of a few common tags used by the format:

Figure 7.3 exemplifies how tags are used to create annotations.

Image importgff3
Figure 7.3: Example of a GFF3 file and the corresponding annotations from https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md.

In the CLC Genomics Workbench, the GFF3 importer will create an output track for each feature type present in the file. In addition, the CLC Genomics Workbench will generate an (RNA) track that aggregates all the types that were "RNA" into one track (i.e., all the children of "mature_transcript", which is the parent of "mRNA", which is the parent of the "NSD_transcript"); and a (Gene) track that includes genes and Gene-like types annotations like ncRNA_gene, plastid_gene, and tRNA_gene. These "(RNA)" and "(Gene)" tracks are different from the ones ending with "_mRNA" and in "_Gene" in that they compile all relevant annotations in a single track, making them the track of choice for subsequent analysis (RNA-Seq for example).

Naming of features

When one of the following qualifiers is present, it will be used for naming in the prioritized order:

  1. the "Name" of the feature
  2. the "Name" of the first named parent of the feature
  3. the "ID" of the feature
  4. the "ID" of the first parent
  5. the type of the feature

Several examples of naming strategies are depicted in figure 7.6.

Image importgff3c
Figure 7.6: Naming of features.

Merged CDS features have a slightly different naming scheme. First, if a CDS feature in the GFF3 file has more than one parent, we create one CDS feature in the CLC Genomics Workbench for each parent, and each is merged with all other CDS features from the GFF3 file that has the parent feature as parent as well. The naming is then done in the following prioritized order:

  1. the "Name" of the feature, if all the constituent CDS features have the same "Name".
  2. the "Name" of the first named parent of the feature, if it has a name.
  3. the "Name" of the first of the merged CDS features with a name.
  4. the "ID" of the first of the merged CDS features with an ID.
  5. the "ID" of the parent.

For features with the same ID, the naming scheme is as follows:

  1. the "Name" of the feature, if all have the same "Name".
  2. If there is a set of common parents for the features and one of the common parents have a "Name", the name of the first common parent with a "Name" is used.
  3. If at least one feature has a name, the name of the first feature with the name is used.
  4. the "ID" of the first of the features

Limits of the GFF3 importer

Features are imported only if their SeqID (i.e., the value in the first column of the gff3) can be matched to the name of a chromosome in the genome. Matching need not be exact (see Special notes on chromosome names synonyms used during import). However, in some cases it may be necessary to manually edit either the names of the genomic sequences (for example in a fasta file), or the SeqIDs in the GFF3 file so that they match. Features without a match aren't imported. You can see the number of skipped features in the importer log.

The start and stop position of a feature cannot extend beyond the ends of a chromosome, unless the chromosome is explicitly marked as circular, which is indicated by « and » at the beginning and the end of the sequence.

Trying to import such a file will fail. One option is to delete the feature that extends beyond the end of the chromosome and to start the import again. Another option is to make the track circular. To do so, convert the linear track into a sequence using the Convert from Tracks tool. Open the sequence and right-click on its name to be able to choose the option "Make sequence circular" from the drop-down menu. Convert the now circular sequence back into a track using the Convert to Tracks tool. Importing the gff3 should now be working.

The following instances are not supported: