Create UMI Reads from Reads

The tool Create UMI Reads from Reads generates a single consensus read, called a UMI read, from reads which have the same or similar UMI, and similar sequence. As input the tool takes reads that have been processed by the Remove and Annotate with Unique Molecular Index tool. It is recommended that the input reads are also trimmed for adapters and homopolymers.

The output of the tool is a list of consensus UMI reads. Optional outputs include a QC Report and a list of discarded reads. The algorithm is loosely inspired by Mash (Ondov 2016), Linclust (Steinegger 2017) and Calib (Orabi 2017).

Known limitations: When provided with paired-end reads, the tool outputs consensus UMI reads for R1 of the pair only.

The algorithm involves clustering similar reads, merging reads in a cluster with the same UMI, and then filtering the merged reads. These steps are described in detail below:

Grouping reads

The tool makes extensive use of the minHash concept for Locality-Sensitive Hashing (LSH) to find clusters of similar reads. Briefly, all k-mers of each read are hashed with a number of hash functions, and for each of these hash functions the lowest hash value over all the k-mers is recorded. Two reads that share a lowest hash value are likely to share a k-mer and are linked together. In practice, a single hash function is not sufficiently specific to link related reads, and so sets of hash functions are used. Reads are only linked if each hash function in the set has the same lowest value for the two reads. The linked reads can be thought of as forming a graph, where each read is a node, and edges are made between the linked reads. Clusters of reads form disconnected subgraphs i.e. they are only linked to each other.

The tool uses LSH in three rounds:

1) In the first round reads are 'coarse clustered', so that it is very likely that similar reads are in the same cluster. The clusters are 'coarse' because it is also likely that unrelated reads are in the same cluster. The UMI is not used during coarse clustering.

2) In the second round each coarse cluster is 'fine clustered' by applying LSH again, but using different hash settings in order to get a more precise clustering. Reads in the resulting clusters are 'merged' if they have the same UMI.

3) In the third round all reads for a coarse cluster (merged or not) are again 'fine clustered' by LSH. Each disconnected subgraph is then pruned to only keep links between reads where (i) the UMIs differ by one mismatch, and (ii) either one of the reads is not merged, or the `weight' of one of the reads is more than twice the weight of the other (this is the directional method in Smith 2017). The `weight' of a merged read is the number of raw reads that were used to form the merged read.

Merging reads

The merging of similar UMI reads is performed using Multiple Sequence Alignment (MSA) based on the SPOA C++ library (https://github.com/rvaser/spoa). In general terms, the POA method creates a graph whose nodes represent the bases of the first sequence, and aligns the second sequence to the graph. Aligned bases of the first/second sequence are then merged in the graph. Every new sequence is aligned to the graph (using a modified version of the usual dynamic programming algorithms) and merged into the graph. As bases are merged into the graph, the edges between the nodes are updated to reflect their `weight', which corresponds to the number of times they occurred in multiple sequences. Finally, when all sequences are aligned, the graph is traversed to find the `heaviest' path, which is the consensus sequence.

When sequences are aligned, mismatching nodes are connected by a special type of edge. Using these edges, it is possible to find all mismatches of a base in the consensus read. With this information in hand, the quality consensus is calculated using a method similar to [Hiatt et al., 2013], which is also used in the other UMI tools. Every node keeps a match and a mismatch quality value (because it is not possible to know at the time of graph construction if it will be a match or not). Every time a new sequence is added and a new node is merged the quality values are updated using the quality scores of the new sequence. After calculating the consensus, the values in each node reflect the improved quality scores due to multiple sequences calling the same base. The quality consensus is obtained by processing the match and a mismatch quality value accordingly as described in [Hiatt et al., 2013].

It can be found in the Toolbox at:

        Tools | QIAseq Panel Expert Tools (Image qiaseq_expert_folder_closed_16_n_p) | QIAseq DNA Panel Expert Tools (Image qiaseqv3_folder_open_16_h_p) | Create UMI Reads from Reads (Image create_umi_from_reads_16_n_p)

In the first dialog, select the reads saved as a sequence list (figure 5.41).

Image umireadsfromreadsinput
Figure 5.41: Select the sequencing reads by double-clicking on the file name or by clicking once on the file name and then on the arrow pointing to the right hand side.

The next dialog of the tool allows you to configure several parameters (figure 5.42).

Image umireadsfromreadsbasicsettings
Figure 5.42: Basic settings of the Create UMI Reads from Reads tool.

The following parameters currently have no effect on the settings used by the tool. They are provided so that when they are set in a workflow, then that workflow will be able to receive application-specific parameter updates in future plugin releases.

The other parameters have an effect:

The advanced settings are split into 'coarse clustering settings' and 'fine clustering settings' (figure 5.43). These are respectively used in the 'coarse clustering' and 'fine clustering' steps described above.

Image umireadsfromreadsadvancedsettings
Figure 5.43: Advanced settings of the Create UMI Reads from Reads tool.

As both clustering steps use the same algorithm, their parameters are similar.

The remaining parameters are: