Subsections


Import Immune Reference Segments

The Import Immune Reference Segments tool can import reference sequences for V, D, J and C segments from a fasta file. The sequences are needed when running "Immune Repertoire Analysis" for either T or B cell receptor repertoires (TCR and BCR, respectively) (see Immune Repertoire Analysis).

The importer can be found here:

        Import (Image Next_Folder_16_n_p) | Import Immune Reference Segments (Image import_immuneref_segment_16_n_p).

The importer can be used to import fasta files that are either in the IMSEQ [Kuchenbecker et al., 2015] or IMGT [Lefranc et al., 2009] format (see figure 8.4).

Image immune_importer
Figure 8.4: The available options when importing immune reference segments.

Both formats support allele numbering for the gene segments. If Import only the first allele is ticked, only segments without an allele or those with an allele defined as the number "1" (i.e "01" is also valid) will be imported. Otherwise, all segments are imported.

The two formats differ in how the sequence header is parsed for identifying the gene segment and related information, and how the conserved amino acids in the V and J segments are identified.

When saving the results, the reference data for either TCR, or BCR, or both, can be saved. The wizard will show an error message if an output option is ticked for which no relevant reference sequences are available.

The importer can only handle one fasta file at a time, but if two or more fasta files are imported, the resulting sequence lists can subsequently be combined to one list using the Create Sequence List tool.

IMSEQ

For the IMSEQ format, the header contains the following elements, separated by "|":

Currently only the heavy (IGH) and light $ \kappa$ and $ \lambda$ (IGK and IGL) chain types are supported for B cells.

Any segments with an unsupported chain or segment type are silently ignored.

IMGT

For the IMGT format, the header contains 15 elements, separated by "|". Only the following are read and used during import:

The IMGT database contains chains, segment types and labels that are not listed above and are not supported. These are silently ignored.

While the IMSEQ format provides the position of the conserved amino acid, this needs to be calculated for the IMGT format. For this, the V region needs to be provided with gaps such that the conserved amino acid is found at approximately position 104 in the translated amino acid sequence. When downloading sequences from the IMGT database in fasta format, the "F+ORF+in-frame P nucleotide sequences with IMGT gaps" should be used. Alternatively, the corresponding "nt-WithGaps-F+ORF+inframeP" flat file can be downloaded from IMGT/GENE-DB.

If using custom reference data that is not downloaded from the IMGT database, it is recommended to use the IMSEQ format and specify the position of the conserved amino acid.

When importing files in the IMGT format, the following options are available (see figure 8.4):

If element (9) in the header is not empty, the corresponding number of nucleotides are removed from the 5' end of the sequence.

Identification of the conserved amino acid

The nucleotide sequence (with IMGT gaps for the V segments), starting from position in element (8) in the header, is first translated to amino acids using the standard genetic code. The position of the conserved amino acid is calculated, and, if identified, translated to the position of the first nucleotide in the corresponding codon. Segments where the amino acid cannot be identified are silently ignored.

For the V segments, the amino acid position is calculated as follows:

For the J segments, all 3 open reading frames (starting from nucleotide position 1, 2 or 3) are used. Note that "." below denotes any amino acid. The amino acid position is calculated as follows:

V and J segments for which the amino acid position cannot be successfully identified are silently ignored.

Output from Import Immune Reference Segments

The importer outputs a sequence list that can be used for immune repertoire analysis. These can be added to a custom reference data set, to be used in workflows. See http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Custom_Sets.html for details.

The sequence list contains the reference sequences for the V, D, J and C segments, named in the format <chain>-<type>-<ID>*<allele>, for example "TRA-V-1*01". Note that for B cells constant genes, the letter corresponding to the encoded isotype will be used instead of the segment type.

If the gene segment does not have an allele or Import only the first allele is ticked, *<allele> is not added to the name.

By ticking "Show annotations" and "Region" in the Side Panel "Annotation layout" and "Annotation types" groups, respectively, the location of the conserved amino acid can be visualized (see figure 8.5).

Image ref_seq_motif_view
Figure 8.5: Visualizing the location of the conserved amino acid.

The table view of the sequence list shows the chain and segment type of each sequence, and for the IMGT format, also the accession number(s) and species (see figure 8.6).

Image ref_seq_table_view
Figure 8.6: Table view of imported sequence list showing the name, species and accession number when imported using the IMGT format.