Immune Repertoire

T cell receptors consist of two chains either an $ \alpha$ and a $ \beta$ chain or a $ \gamma$ and a $ \delta$ chain. In humans the $ \alpha\beta$ type is the most abundant. Each chain is encoded by a gene that undergoes somatic recombination. In the somatic recombination process mini genes are joined with random nucleotides at the junction sites. $ \alpha$ and $ \gamma$ chains are the result of VJ recombination and $ \beta$ and $ \delta$ are the result of VDJ recombination. In VJ recombination one V gene-segment is joined to a J gene-segment.

The V segment of the T cell receptor contains a conserved cysteine amino-acid marking the beginning of the so-called CDR3 region and the J segment contains a conserved phenylalanine amino-acid marking the end of the CDR3 region. Due to inclusion of random nucleotides at the junctions between segments the CDR3 region is highly variable.

Clonotyping a read consists of identifying which V segment and which J segment is used and extracting the CDR3 region. A database with V segment and J segment reference sequences are used for identifying the V segments and J segments and finding the start position (V) or end position (J) of the CDR3 region.

The tool assumes that reads are either paired-end with read 1 containing V(D)J in reverse and read 2 containing V or that the reads are single-end containing V(D)J in reverse. Primers are often designed to target the constant (C) region following the J region, the constant part of the read should be trimmed from the read prior to the analysis. This can be done using the Trim Reads tool and the trim adapter lists provided as reference data.

V segments are rather long ($ >200$ bp), whereas J segments are relatively short ( $ \approx 50-70$ bp). The identification of V and J segments is therefore performed using two different strategies. For V segments the Map Reads to Reference tool is used internally. Where the Map Reads to Reference would normally either map non-specific matches randomly or ignore them, a read with multiple segments matches will provisionally have all these segments assigned. In a subsequent merging step specific segment may be assigned to the read. The heuristics used in the Map Reads to Reference tool would lead to misidentified segments if applied to the short J segments. Instead a strategy similar to IMSEQ [Kuchenbecker et al., 2015] is used. In brief, first a pairwise alignment with a 15 bp subsequence of the full segment called a Segment Core Fragment (SCF) is performed to find candidates for full pairwise alignment. If the pairwise alignment of an SCF to the read has a sufficiently small number of errors, it is nominated as a candidate. A full pairwise alignment is then made for all the segments corresponding to the candidate SCFs. If there is a sufficiently good match among the full alignments it will be assigned to the read.

After the initial identification of clonotypes some clonotypes are merged to reduce false positives due to sequencing errors and resolve ambiguities. Clonotype merging is performed in two steps. The first step tries to resolve ambiguous segment assignments. Some of the reference V and J segments have a large degree of sequence identity, e.g. in mouse a recent duplication event has resulted in multiple paralogue V segments with a sequence identity of more than 97%. If a sequencing read does not cover the sites where paralogue segments differ, the segment cannot be unambiguously identified. In these cases all possible V and/or J segments will be listed using a comma for separation of the different options. However, there might be reads with the same CDR3 nucleotide sequence where the segment can be uniquely determined. It is unlikely that two different clonotypes would share the same CDR3 and have almost identical V and J segments. We thus merge clonotypes with ambiguous segments into clonotypes with unambiguous segments provided they share CDR3 sequence and their V and J segments overlap.

The second merging step tries to correct sequencing errors in the CDR3 region. Sequencing errors in the CDR3 region of a highly expressed clonotype would result in multiple clonotypes being reported if not corrected for. In this step, clonotypes are merged if their V and J segments are identical and the CDR3 region is sufficiently similar. We merge the read counts from one clonotype into a clonotype with higher expression. The minimal ratio between two clonotypes eligible for merging can be adjusted. For two CDR3 regions to be deemed sufficiently similar two types of errors are considered: errors occurring in positions of low quality and errors occurring anywhere within the CDR3 region. A position is considered of low quality if the average quality of the position is more than a set number of standard deviations lower than the average quality at each position in the CDR3 sequence. The two error budgets can be combined.