The tool Create UMI Reads from Reads generates a single consensus read, called a UMI read, from reads that have the same or similar UMI, and similar sequences. It can be used to process single end reads and paired end reads, including reads generated by duplex sequencing methods. Reads must be preprocessed by the Remove and Annotate with Unique Molecular Index tool before running Create UMI Reads from Read. It is recommended that the input reads are also trimmed for adapters and homopolymers.
This tool outputs a list of consensus UMI reads. Optional outputs include a QC Report and a list of discarded reads.
The algorithm is loosely inspired by Mash (Ondov 2016), Linclust (Steinegger 2017) and Calib (Orabi 2017) and involves clustering similar reads, merging reads in a cluster with the same UMI, and then filtering the merged reads. These steps are described in detail below:
The tool makes extensive use of the minHash concept for Locality-Sensitive Hashing (LSH) to find clusters of similar reads. Briefly, all k-mers of each read are hashed with a number of hash functions, and for each of these hash functions the lowest hash value over all the k-mers is recorded. Two reads that share a lowest hash value are likely to share a k-mer and are linked together. In practice, a single hash function is not sufficiently specific to link related reads, and so sets of hash functions are used. Reads are only linked if each hash function in the set has the same lowest value for the two reads. The linked reads can be thought of as forming a graph, where each read is a node, and edges are made between the linked reads. Clusters of reads form disconnected subgraphs i.e. they are only linked to each other.
The tool uses LSH in three rounds:
1) In the first round reads are 'coarse clustered', so that it is very likely that similar reads are in the same cluster. The clusters are 'coarse' because it is also likely that unrelated reads are in the same cluster. The UMI is not used during coarse clustering.
2) In the second round each coarse cluster is 'fine clustered' by applying LSH again, but using different hash settings in order to get a more precise clustering. Reads in the resulting clusters are 'merged' if they have the same UMI.
3) In the third round all reads for a coarse cluster (merged or not) are again 'fine clustered' by LSH. Each disconnected subgraph is then pruned to only keep links between reads where (i) the UMIs differ by one mismatch, and (ii) either one of the reads is not merged, or the `weight' of one of the reads is more than twice the weight of the other (this is the directional method in Smith 2017). The `weight' of a merged read is the number of raw reads that were used to form the merged read.
The merging of similar UMI reads is performed using Multiple Sequence Alignment (MSA) based on the SPOA C++ library (https://github.com/rvaser/spoa). In general terms, the POA method creates a graph whose nodes represent the bases of the first sequence, and aligns the second sequence to the graph. Aligned bases of the first/second sequence are then merged in the graph. Every new sequence is aligned to the graph (using a modified version of the usual dynamic programming algorithms) and merged into the graph. As bases are merged into the graph, the edges between the nodes are updated to reflect their `weight', which corresponds to the number of times they occurred in multiple sequences. Finally, when all sequences are aligned, the graph is traversed to find the `heaviest' path, which is the consensus sequence.
When sequences are aligned, mismatching nodes are connected by a special type of edge. Using these edges, it is possible to find all mismatches of a base in the consensus read. With this information in hand, the quality consensus is calculated using a method similar to [Hiatt et al., 2013], which is also used in the other UMI tools. Every node keeps a match and a mismatch quality value (because it is not possible to know at the time of graph construction if it will be a match or not). Every time a new sequence is added and a new node is merged the quality values are updated using the quality scores of the new sequence. After calculating the consensus, the values in each node reflect the improved quality scores due to multiple sequences calling the same base. The quality consensus is obtained by processing the match and a mismatch quality value accordingly as described in [Hiatt et al., 2013].
Reads produced by the TSO (TruSight Oncology 500, Illumina) duplex sequencing protocol can be processed to create a duplex consensus if reads from both strands are available. To create this consensus, reads probably originating from the same molecule and strand are grouped based on their similarity, UMI sequence and strand information. Single-strand consensus sequences are generated from these groups. Then, the consensus sequences of the opposing strands are combined to generate the duplex consensus sequences.
Remove and Annotate with Unique Molecular Index for further information).
Create UMI Reads from Reads can be found in the Toolbox at:
Tools | QIAseq Panel Expert Tools () | QIAseq DNA Panel Expert Tools () | Create UMI Reads from Reads ()
In the first dialog, select the sequence list containing the reads (figure 5.41).
The next dialog allows you to configure basic settings for this tool, as shown in figure 5.42 and described below.
- Read structure The read type and how they should be processed. The options are:
- Single end reads When selected, each step of the algorithm described above is performed on each read.
- Paired end reads When selected, each step of the algorithm described above is performed on read 1 (R1) and read 2 (R2) separately. During the hashing step, specific hash nodes are created only for R1. The purpose is to keep track of the UMI barcode and the LSH hashes. To link R1 and R2 a paired UMI reads object is created. In the final part of merging, reads are only combined in paired reads if R1 and R2 are similar. Strand specificity is not considered when this option is selected.
- Paired end reads (discard read 2) Only the R1 member of a pair is processed. Discarding R2 is useful when it does not contain any biological information.
- Paired end reads with Duplex UMI (TSO) Appropriate for reads generated using the the TSO protocol. A duplex consensus will be generated. Here, both reads are hashed and added to the list for grouping. R1 hashes together with "alpha + beta" UMI, and R2 hashes together with "beta + alpha" UMI. In this way, reads from both strands end up in the same "coarse" groups. In the final part of merging, reads are only combined in paired reads if R1 and R2 are similar, the same as for the "Paired end reads" case, described above.
- Minimum UMI read length: UMI reads shorter than this value will be discarded.
- Minimum average quality score: UMI reads will be discarded, if their average Q-score is lower than the value specified here.
- Minimum group size: The tool will only create a UMI read if the number of reads in the UMI group is at least the size specified here.
- Keep duplex consensus reads only: When selected, only UMI reads built from reads representing both strands are retained. This is only relevant when a duplex UMI option has been selected for the "Read structure" setting.
- Enable advanced settings: When enabled, the advanced settings, available in the next wizard step, can be configured.
The next wizard step lists the advanced settings, which are organized in three categories: "Consensus options", "Coarse grouping" options, used for the coarse clustering step, and "Fine grouping" options, using for the fine clustering step (figure 5.43).
As both clustering steps use the same algorithm, their parameters are similar.
- Hasher type:
- Simple k-mer hasher: Hashes are computed for every k-mer of a read.
- Spaced seed hasher: Hashes are computed over multiple subsets of positions within the k-mer. For example, a spaced seed hasher of length 5 might make shorter 3-mers of positions 1,2,5 and 1,3,5.
- K-mer length:
- Simple k-mer hasher: A length in the range 2-32. Shorter k-mers lead to coarser clusters.
- Spaced seed hasher: A length of 5, 8, 12, or 16. Shorter k-mers lead to coarser clusters.
- Number of hashes: the total number of hash functions applied to each k-mer of the read
- Similarity factor / Minimum similarity (same UMI) / Minimum similarity (similar UMI): hash functions are divided into groups of this size. For two reads to be linked, all the hash functions in the group must have the same minimum hash values for the two reads. Therefore the number of hashes must be an exact multiple of this number. The higher this value, the finer the clusters.
The remaining parameters are:
- Segment length: Only compute hashes over this number of bases at the start of a read. By only clustering on the start of the read, we reduce the chance of merging reads with different start positions (and which therefore likely come from different fragments) into a consensus UMI read.
- Set ambiguous nucleotide to N (checked by default): This option determines whether to attempt to error correct UMI reads, or to replace conflict positions with 'N' to indicate that there is uncertainty about the true nucleotide. Error correction works by majority vote, and is useful when the predominant source of error is expected to be sequencing error. Replacing conflict positions with 'N' is more conservative, effectively discarding a position. This can be desired when errors are introduced by library preparation.