The clonotype identification algorithm
The algorithm for identifying the clonotypes is composed of three sequential steps described below.
Assembly
All reads originating from the same barcode are collected and:
- Barcodes containing ambiguous nucleotides are discarded.
- Barcodes with less than 5 UMIs are discarded.
- Barcodes with more than 80,000 reads are down-sampled to about 80,000 reads.
- Remaining reads are de novo assembled into contigs.
- Contigs shorter than 60 nucleotides are discarded.
- The reads are mapped back to the valid contigs and the contigs are adjusted by the mapped reads.
- Contigs and barcodes of low quality are discarded. The following are required:
- Contigs should have an average coverage of at least 5.
- Contigs should have at least 20 mapped reads.
- If more than four contigs are assembled, contigs should have an average coverage of at least the median average coverage of all contigs.
- Barcodes should have at least 3 UMIs mapped to high-quality contigs.
The assemble summary reports:
- the number of input barcodes and reads;
- the number of processed barcodes (those without ambiguous nucleotides) and reads (those left after down-sampling);
- the number of barcodes that have been discarded;
- the number of barcodes and high-quality contigs that have been successfully assembled;
Trimming
Prior to clonotype identification, the contigs are trimmed with the following settings:
- The ends of the contigs are trimmed using 0.05 "Quality limit" and 2 "Maximum number of ambiguities", see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Quality_trimming.html.
- Contigs shorter than 60 nucleotides after trimming are discarded.
The trimming summary reports the average length of the contigs before and after trimming, and how many barcodes and contigs remain after trimming.
Clonotype identification
Clonotyping a contig consists of identifying which V, D, J and C segments from the reference data are used, and extracting the CDR3 region found between the conserved amino acids.
The identification of the segments is done by mapping the contigs against the references provided in "Reference segments".
Depending on the length and diversity of the segment that is covered by the contig, it might not be possible to unambiguously detect the segment. In this case, all possible segments are reported.
The V and J segments are required for successfully clonotyping a read, because otherwise the CDR3 cannot be determined.
The D and C segments are optional. Note that the (lack of) identification of these two segment types can lead to the tool reporting clonotypes as the same or different clonotypes:
|
After the initial clonotyping of the contigs, merging of clonotypes identified for the same barcode is performed as follows:
- If a clonotype has ambiguously assigned segments, it will be merged, if possible, into a clonotype with the same CDR3 and less ambiguous segments that are a subset of the former clonotype's segments.
- If two clonotypes exist with the same segments, but differing by a single nucleotide in the CDR3 sequence, the clonotype with fewer reads will be merged into the other.