GFF3 format
A GFF3 file contains a list of various types of annotations that can be linked
together with "Parent" and "ID" tags.
Here are some example of a few common tags used by the format:
- ID IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e., a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.
- Parent A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. A parent ID can only be used to indicate a `part of' relationship.
- Name The name that will be displayed as a label in the track view. Unlike IDs, there is no requirement that the Name be unique within the file.
Figure 6.3 exemplifies how tags are used to create annotations.
Figure
6.3: Example of a GFF3 file and the corresponding annotations from https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md.
In the workbench, the GFF3 importer will create an output track for each feature type present in the file.
- Gene-like types. These are types described in the Sequence Ontology as being subtypes of genes, e.g. ncRNA_gene, plastid_gene, tRNA_gene. Gene-like types are gathered together into an aggregated track with a name of the form "myFileName (Gene)". We recommend that users use this file in RNA-Seq.
- Transcript-like types. These are types described in the Sequence Ontology as being subtypes of transcripts that are neither primary transcripts (i.e., they do not require further processing to become functional), nor fusion transcripts. Again, there are several dozen, such as mRNA, lnc_RNA, threonyl_RNA. Transcript-like types are gathered together into an aggregated track with a name of the form "myFileName (RNA)". We recommend that users use this file in RNA-Seq.
- Exons. Where possible, exons are merged into their
parent features. For example, the output of the lines shown in figure 6.4 will be a single
mRNA feature with four exonic regions (from 1300 to 1500, 3000 to 3902, 5000 to 5500,and 7000 to 9000), and no exon features will be output on their own.
Figure 6.4: Exons will be merged into their parent features when the parent is not a "gene-like" type.However, in cases where the parent is of a "gene-like" type, exons are output as their own independent features in the exon track. Finding a lot of features in the exon track can suggest a problem with the file being imported. However, with large databases, this is more likely to be due to the database creators choosing to represent pseudogenes as exons with no transcript.
- CDS CDS regions with the same parent are joined
together into a single spliced feature. If CDS features do not have a parent
they are instead joined based on their ID, as for any other feature
(described below)
- Features with the same ID Regardless of the
feature type, features that have the same ID are merged into a single spliced
feature. For example, the output of the following figure 6.5 will be a single cDNA_match
feature with regions (1050..1500, 5000..5500, 7000..9000).
Figure 6.5: Features that have the same ID are merged into a single spliced feature.
Naming of features
When one of the following qualifiers is present, it will be used for naming in the prioritized order:
- the "Name" of the feature
- the "Name" of the first named parent of the feature
- the "ID" of the feature
- the "ID" of the first parent
- the type of the feature
Several examples of naming strategies are depicted in figure 6.6.
Figure
6.6: Naming of features.
Merged CDS features have a slightly different naming scheme. First, if a CDS feature in the GFF3 file has more than one parent, we create one CDS feature in the workbench for each parent, and each is merged with all other CDS features from the GFF3 file that has the parent feature as parent as well. The naming is then done in the following prioritized order:
- the "Name" of the feature, if all the constituent CDS features have the same "Name".
- the "Name" of the first named parent of the feature, if it has a name.
- the "Name" of the first of the merged CDS features with a name.
- the "ID" of the first of the merged CDS features with an ID.
- the "ID" of the parent.
For features with the same ID, the naming scheme is as follows:
- the "Name" of the feature, if all have the same "Name".
- If there is a set of common parents for the features and one of the common parents have a "Name", the name of the first common parent with a "Name" is used.
- If at least one feature has a name, the name of the first feature with the name is used.
- the "ID" of the first of the features
Limits of the GFF3 importer
Features are imported only if their SeqID (i.e., the value in the first column of the gff3) can be matched to the name of a chromosome in the genome. Matching need not be exact (see Special notes on chromosome names synonyms used during import). However, in some cases it may be necessary to manually edit either the names of the genomic sequences (for example in a fasta file), or the SeqIDs in the GFF3 file so that they match. Features without a match aren't imported. You can see the number of skipped features in the importer log.
The start and stop position of a feature cannot extend beyond the ends of a chromosome, unless the chromosome is explicitly marked as circular, which is indicated by « and » at the beginning and the end of the sequence.
Trying to import such a file will fail. One option is to delete the feature that extends beyond the end of the chromosome and to start the import again.
The following instances are not supported:
- Interpreting SOFA accession numbers. The type of the feature is constrained to be either: (a) a term from the "lite" sequence ontology, SOFA; or (b) a SOFA accession number, distinguished using the syntax SO:000000. The importer recognizes terms from SOFA as well as terms from the full Sequence Ontology, but will not translate accession numbers to types. So for example, features with type SO:0000316 will not be
interpreted as "CDS" but will be handled like any other type.
- The fasta directive ##FASTA. This FASTA section situated at the end of a GFF3 file specifies sequences of ESTs as well as of contigs. The GFF3 importer will ignore these sequences.
- Alignments. An aligned feature is handled as a single region, with the Gap
and Target attributes added as annotations. We do not use Gap and Target to
show how the feature aligns.
- Comment lines. We do not interpret lines
beginning with a #. Especially relevant are lines "##sequence-region seqid
start end" which some parsers use to perform bounds checking of features. Our
bounds checking is instead performed against the user-supplied genome.