IMGT
For the IMGT format, the header contains 15 elements, separated by "|". Only the following are read and used during import:
- (1) Accession number(s).
- (2) The segment name, including chain, segment type, ID and allele, in the format: <chain><type><ID>*<allele>, for example "TRAV1*01".
Chain and segment type are the same as for IMSEQ. For B cells constant genes, the segment type contains instead the letter corresponding to the encoded isotype.
- (3) Species.
- (4) Allele functionality: F (functional), P (pseudogene) or ORF (open reading frame).
- (5) Extracted label(s): EX1, CH1 and C-REGION for C segments, and V-REGION, D-REGION, J-REGION for V, D, J segments, respectively.
- (8) The start of the codon, counting from 1, or "NR" for non coding labels.
- (9) The number of nucleotides added in 5' in the format +n.
The IMGT database contains chains, segment types and labels that are not listed above and are not supported. These are silently ignored.
While the IMSEQ format provides the position of the conserved amino acid, this needs to be calculated for the IMGT format. For this, the V region needs to be provided with gaps such that the conserved amino acid is found at approximately position 104 in the translated amino acid sequence. When downloading sequences from the IMGT database in fasta format, the "F+ORF+in-frame P nucleotide sequences with IMGT gaps" should be used. Alternatively, the corresponding "nt-WithGaps-F+ORF+inframeP" flat file can be downloaded from IMGT/GENE-DB.
If using custom reference data that is not downloaded from the IMGT database, it is recommended to use the IMSEQ format and specify the position of the conserved amino acid. |
When importing files in the IMGT format, the following options are available (see figure 2.14):
- Which allele functionality(ies) should be imported. At least one must be chosen.
- Which species should be imported. After choosing the fasta file, the desired species can be chosen from the list of species identified in the file.
If element (9) in the header is not empty, the corresponding number of nucleotides are removed from the 5' end of the sequence.
Identification of the conserved amino acid
The nucleotide sequence (with IMGT gaps for the V segments), starting from position in element (8) in the header, is first translated to amino acids using the standard genetic code. The position of the conserved amino acid is calculated, and, if identified, translated to the position of the first nucleotide in the corresponding codon. Segments where the amino acid cannot be identified are silently ignored.
For the V segments, the amino acid position is calculated as follows:
- If the amino acid at position 104 is C, then position 104 is used.
- Otherwise, the position of the last occurrence of C after position 104 is used, if present.
- Otherwise, if the amino acid at position 104, 105 or 103 (in this order) is one base pair mutation away from C and not a stop codon (i.e. R, S, C, F, G, W, Y), then this position is used.
For the J segments, all 3 open reading frames (starting from nucleotide position 1, 2 or 3) are used. Note that "." below denotes any amino acid. The amino acid position is calculated as follows:
- The amino acid sequence "(F|W)G.G", if present, is identified.
- The open reading frame that contains the amino acid sequence, no stop codon and has the lowest nucleotide starting position, if any, is used.
- Otherwise, the open reading frame that contains the amino acid sequence and at least one stop codon, if any, is used. If multiple open reading frames match this criteria, none are used.
- Otherwise, the amino acid sequences "(F|W)X.G" and "(F|W)G.X", if present, are identified. Here, X denotes the amino acids that are one base pair mutation away from F/W and not a stop codon (i.e. A, R, S, C, D, E, V, W).
- For each of the two amino acid sequences, the position is calculated as above.
- If both amino acid sequences are present, the position that is closest to the end of the sequence is used.
V and J segments for which the amino acid position cannot be successfully identified are silently ignored.