Create UMI Reads from Reads
The tool Create UMI Reads from Reads generates a single consensus read, called a UMI read, from reads which have the same or similar UMI, and similar sequence. As input the tool takes reads that have been processed by the Remove and Annotate with Unique Molecular Index tool. It is recommended that the input reads are also trimmed for adapters and homopolymers.
The output of the tool is a list of consensus UMI reads. Optional outputs include a QC Report and a list of discarded reads. The algorithm is loosely inspired by Mash (Ondov 2016), Linclust (Steinegger 2017) and Calib (Orabi 2017).
Known limitations: When provided with paired-end reads, the tool outputs consensus UMI reads for R1 of the pair only.
The algorithm involves clustering similar reads, merging reads in a cluster with the same UMI, and then filtering the merged reads. These steps are described in detail below:
Grouping reads
The tool makes extensive use of the minHash concept for Locality-Sensitive Hashing (LSH) to find clusters of similar reads. Briefly, all k-mers of each read are hashed with a number of hash functions, and for each of these hash functions the lowest hash value over all the k-mers is recorded. Two reads that share a lowest hash value are likely to share a k-mer and are linked together. In practice, a single hash function is not sufficiently specific to link related reads, and so sets of hash functions are used. Reads are only linked if each hash function in the set has the same lowest value for the two reads. The linked reads can be thought of as forming a graph, where each read is a node, and edges are made between the linked reads. Clusters of reads form disconnected subgraphs i.e. they are only linked to each other.
The tool uses LSH in three rounds:
1) In the first round reads are 'coarse clustered', so that it is very likely that similar reads are in the same cluster. The clusters are 'coarse' because it is also likely that unrelated reads are in the same cluster. The UMI is not used during coarse clustering.
2) In the second round each coarse cluster is 'fine clustered' by applying LSH again, but using different hash settings in order to get a more precise clustering. Reads in the resulting clusters are 'merged' if they have the same UMI.
3) In the third round all reads for a coarse cluster (merged or not) are again 'fine clustered' by LSH. Each disconnected subgraph is then pruned to only keep links between reads where (i) the UMIs differ by one mismatch, and (ii) either one of the reads is not merged, or the `weight' of one of the reads is more than twice the weight of the other (this is the directional method in Smith 2017). The `weight' of a merged read is the number of raw reads that were used to form the merged read.
Merging reads
The merging of similar UMI reads is performed using Multiple Sequence Alignment (MSA) based on the SPOA C++ library (https://github.com/rvaser/spoa). In general terms, the POA method creates a graph whose nodes represent the bases of the first sequence, and aligns the second sequence to the graph. Aligned bases of the first/second sequence are then merged in the graph. Every new sequence is aligned to the graph (using a modified version of the usual dynamic programming algorithms) and merged into the graph. As bases are merged into the graph, the edges between the nodes are updated to reflect their `weight', which corresponds to the number of times they occurred in multiple sequences. Finally, when all sequences are aligned, the graph is traversed to find the `heaviest' path, which is the consensus sequence.
When sequences are aligned, mismatching nodes are connected by a special type of edge. Using these edges, it is possible to find all mismatches of a base in the consensus read. With this information in hand, the quality consensus is calculated using a method similar to [Hiatt et al., 2013], which is also used in the other UMI tools. Every node keeps a match and a mismatch quality value (because it is not possible to know at the time of graph construction if it will be a match or not). Every time a new sequence is added and a new node is merged the quality values are updated using the quality scores of the new sequence. After calculating the consensus, the values in each node reflect the improved quality scores due to multiple sequences calling the same base. The quality consensus is obtained by processing the match and a mismatch quality value accordingly as described in [Hiatt et al., 2013].
It can be found in the Toolbox at:
Tools | QIAseq Panel Expert Tools () | QIAseq DNA Panel Expert Tools () | Create UMI Reads from Reads ()
In the first dialog, select the reads saved as a sequence list (figure 5.41).
Figure 5.41: Select the sequencing reads by double-clicking on the file name or by clicking once on the file name and then on the arrow pointing to the right hand side.
The next dialog of the tool allows you to configure several parameters (figure 5.42).
Figure 5.42: Basic settings of the Create UMI Reads from Reads tool.
The following parameters currently have no effect on the settings used by the tool. They are provided so that when they are set in a workflow, then that workflow will be able to receive application-specific parameter updates in future plugin releases.
- Sequencing technology: Illumina or Ion Torrent. If Ion Torrent is selected, the available Read structure options do not include paired-end reads.
- Analysis type: RNA Expression, RNAscan Fusions, or 3' UPX.
The other parameters have an effect:
- Minimum average quality score: UMI reads will be discarded, if their average Q-score is lower than "Minimum average quality score".
- Minimum UMI read length: UMI shorter than this value will be discarded.
- Minimum group size: The tool will only create a UMI read if the number of reads in the UMI is at least "Minimum group size".
- Enable advanced settings: use custom settings for grouping reads rather than the application-specific values determined by the combination of "Sequencing technology" and "Analysis type". This means that it will no longer matter if you choose Illumina or Ion Torrent as sequencing technology, and the Analysis type will also be disregarded. Ticking this box will enable an Advanced Settings wizard step.
The advanced settings are split into 'coarse clustering settings' and 'fine clustering settings' (figure 5.43). These are respectively used in the 'coarse clustering' and 'fine clustering' steps described above.
Figure 5.43: Advanced settings of the Create UMI Reads from Reads tool.
As both clustering steps use the same algorithm, their parameters are similar.
- Hasher type:
- Simple k-mer hasher: Hashes are computed for every k-mer of a read.
- Spaced seed hasher: Hashes are computed over multiple subsets of positions within the k-mer. For example, a spaced seed hasher of length 5 might make shorter 3-mers of positions 1,2,5 and 1,3,5.
- K-mer length:
- Simple k-mer hasher: A length in the range 2-32. Shorter k-mers lead to coarser clusters.
- Spaced seed hasher: A length of 5, 8, 12, or 16. Shorter k-mers lead to coarser clusters.
- Number of hashes: the total number of hash functions applied to each k-mer of the read
- Similarity factor / Minimum similarity (same UMI) / Minimum similarity (similar UMI): hash functions are divided into groups of this size. For two reads to be linked, all the hash functions in the group must have the same minimum hash values for the two reads. Therefore the number of hashes must be an exact multiple of this number. The higher this value, the finer the clusters.
The remaining parameters are:
- Segment length: Only compute hashes over this number of bases at the start of a read. By only clustering on the start of the read, we reduce the chance of merging reads with different start positions (and which therefore likely come from different fragments) into a consensus UMI read.
- Set ambiguous nucleotide to N (checked by default): This option determines whether to attempt to error correct UMI reads, or to replace conflict positions with 'N' to indicate that there is uncertainty about the true nucleotide. Error correction works by majority vote, and is useful when the predominant source of error is expected to be sequencing error. Replacing conflict positions with 'N' is more conservative, effectively discarding a position. This can be desired when errors are introduced by library preparation.