The clonotype identification algorithm
The algorithm for identifying the clonotypes is composed of three sequential steps described below.
Assembly
All reads originating from the same barcode are collected and:
- Barcodes containing ambiguous nucleotides are discarded.
- Barcodes with less than 5 UMIs are discarded.
- Barcodes with more than 80,000 reads are down-sampled to about 80,000 reads.
- Remaining reads are de novo assembled into contigs.
- Contigs shorter than 60 nucleotides are discarded.
- The reads are mapped back to the valid contigs and the contigs are adjusted by the mapped reads.
- Contigs and barcodes of low quality are discarded. The following are required:
- Contigs should have an average coverage of at least 5.
- Contigs should have at least 20 mapped reads.
- If more than four contigs are assembled, contigs should have an average coverage of at least the median average coverage of all contigs.
- Barcodes should have at least 3 UMIs mapped to high-quality contigs.
The assemble summary reports:
- the number of input barcodes and reads;
- the number of processed barcodes (those without ambiguous nucleotides) and reads (those left after down-sampling);
- the number of barcodes that have been discarded;
- the number of barcodes and high-quality contigs that have been successfully assembled;
Trimming the C gene segments
The C gene segments need trimming prior to clonotype identification. The contigs are therefore trimmed with the following settings:
- The ends of the contigs are trimmed such that at most 2 ambiguous nucleotides are left in the contig.
- The C gene segments provided in the "Trim constant regions" as a Trim Adapter List are removed.
- Contigs shorter than 60 nucleotides after trimming are discarded.
The trimming summary reports the average length of the contigs before and after trimming, and how many barcodes and contigs remain after trimming.
Clonotype identification
T-cell receptors come in two varieties, either + or + T-cell receptors, with the + type being the far most abundant. Each chain is encoded by a gene that undergoes somatic recombination. In the somatic recombination process, gene segments are joined together with random nucleotides added at the junction sites. and chains are the result of V and J gene segments recombination, while and are the result of V, D and J gene segments recombination. During the recombination, the gene segments are joined together.
The V gene segment contains a conserved cysteine amino-acid marking the beginning of the CDR3 region and the J gene segment contains a conserved phenylalanine amino-acid marking the end of the CDR3 region. The CDR3 region is highly variable.
Clonotyping a contig consists of identifying which V and J gene segments are used and extracting the CDR3 region. The D gene segments are not identified here.
The identification of V and J gene segments is done by mapping the contigs against references containing all V or all J gene segments, provided in "Reference segments".
Depending on the length of the gene segment that is covered by the contig and the diversity of the gene segment, it might not be possible to unambiguously detect the gene segment. In this case, all possible gene segments are reported.
After the initial clonotyping of the contigs, merging of clonotypes is performed as follows:
- If a clonotype has an ambiguously assigned V and / or J, it will be merged, if possible, into a clonotype with the same CDR3 and unambiguous V and J assignment.
- If two clonotypes exist with the same V and J, but differing by a single nucleotide in the CDR3 sequence, the clonotype with fewer reads will be merged into the other.