Create UMI Reads from Reads

The tool Create UMI Reads from Reads generates a single consensus read, called a UMI read, from reads that have the same or similar UMI, and similar sequences. It can be used to process single end reads and paired end reads, including reads generated by duplex sequencing methods. Reads must be preprocessed by the Remove and Annotate with Unique Molecular Index tool before running Create UMI Reads from Read. It is recommended that the input reads are also trimmed for adapters and homopolymers.

This tool outputs a list of consensus UMI reads. Optional outputs include a QC Report and a list of discarded reads.

The algorithm is loosely inspired by Mash (Ondov 2016), Linclust (Steinegger 2017) and Calib (Orabi 2017) and involves clustering similar reads, merging reads in a cluster with the same UMI, and then filtering the merged reads. These steps are described in detail below:

Grouping reads

The tool makes extensive use of the minHash concept for Locality-Sensitive Hashing (LSH) to find clusters of similar reads. Briefly, all k-mers of each read are hashed with a number of hash functions, and for each of these hash functions the lowest hash value over all the k-mers is recorded. Two reads that share a lowest hash value are likely to share a k-mer and are linked together. In practice, a single hash function is not sufficiently specific to link related reads, and so sets of hash functions are used. Reads are only linked if each hash function in the set has the same lowest value for the two reads. The linked reads can be thought of as forming a graph, where each read is a node, and edges are made between the linked reads. Clusters of reads form disconnected subgraphs i.e. they are only linked to each other.

The tool uses LSH in three rounds:

1) In the first round reads are 'coarse clustered', so that it is very likely that similar reads are in the same cluster. The clusters are 'coarse' because it is also likely that unrelated reads are in the same cluster. The UMI is not used during coarse clustering.

2) In the second round each coarse cluster is 'fine clustered' by applying LSH again, but using different hash settings in order to get a more precise clustering. Reads in the resulting clusters are 'merged' if they have the same UMI.

3) In the third round all reads for a coarse cluster (merged or not) are again 'fine clustered' by LSH. Each disconnected subgraph is then pruned to only keep links between reads where (i) the UMIs differ by one mismatch, and (ii) either one of the reads is not merged, or the `weight' of one of the reads is more than twice the weight of the other (this is the directional method in Smith 2017). The `weight' of a merged read is the number of raw reads that were used to form the merged read.

Merging reads

The merging of similar UMI reads is performed using Multiple Sequence Alignment (MSA) based on the SPOA C++ library (https://github.com/rvaser/spoa). In general terms, the POA method creates a graph whose nodes represent the bases of the first sequence, and aligns the second sequence to the graph. Aligned bases of the first/second sequence are then merged in the graph. Every new sequence is aligned to the graph (using a modified version of the usual dynamic programming algorithms) and merged into the graph. As bases are merged into the graph, the edges between the nodes are updated to reflect their `weight', which corresponds to the number of times they occurred in multiple sequences. Finally, when all sequences are aligned, the graph is traversed to find the `heaviest' path, which is the consensus sequence.

When sequences are aligned, mismatching nodes are connected by a special type of edge. Using these edges, it is possible to find all mismatches of a base in the consensus read. With this information in hand, the quality consensus is calculated using a method similar to [Hiatt et al., 2013], which is also used in the other UMI tools. Every node keeps a match and a mismatch quality value (because it is not possible to know at the time of graph construction if it will be a match or not). Every time a new sequence is added and a new node is merged the quality values are updated using the quality scores of the new sequence. After calculating the consensus, the values in each node reflect the improved quality scores due to multiple sequences calling the same base. The quality consensus is obtained by processing the match and a mismatch quality value accordingly as described in [Hiatt et al., 2013].

Duplex UMI

Reads produced by the TSO (TruSight Oncology 500, Illumina) duplex sequencing protocol can be processed to create a duplex consensus if reads from both strands are available. To create this consensus, reads probably originating from the same molecule and strand are grouped based on their similarity, UMI sequence and strand information. Single-strand consensus sequences are generated from these groups. Then, the consensus sequences of the opposing strands are combined to generate the duplex consensus sequences.

Background information:

A protocol is considered duplex if the unique molecular barcode(s) attached to the target molecules include enough information to determine if a read originates from the same or from the opposite strand of the same target molecule. For the TSO protocol, the dual stranded reads have different barcodes attached to each end of the fragments, resulting in each read having two different barcodes named alpha and beta. Reads with the same orientation have identical alpha and beta tags, whereas reverse complements of the reads have alpha reads matching beta reads and vice versa (see Remove and Annotate with Unique Molecular Index for further information).

Running the tool

Create UMI Reads from Reads can be found in the Toolbox at:

        Toolbox | Biomedical Genomics Analysis (Image biomedical_folder_closed_16_n_p) | UMI Tools (Image qiaseqv3_folder_open_16_h_p) | Create UMI Reads from Reads (Image create_umi_from_reads_16_n_p)

In the first dialog, select the sequence list containing the reads (figure 4.6).

Image umireadsfromreadsinput
Figure 4.6: Select the sequencing reads by double-clicking on the file name or by clicking once on the file name and then on the arrow pointing to the right hand side.

The next dialog allows you to configure basic settings for this tool, as shown in figure 4.7 and described below.

Image umireadsfromreadsbasicsettings
Figure 4.7: Basic settings of the Create UMI Reads from Reads tool.

The next wizard step lists the advanced settings, which are organized in three categories: "Consensus options", "Coarse grouping" options, used for the coarse clustering step, and "Fine grouping" options, using for the fine clustering step (figure 4.8).

Image umireadsfromreadsadvancedsettings
Figure 4.8: Advanced settings of the Create UMI Reads from Reads tool.

As both clustering steps use the same algorithm, their parameters are similar.

The remaining parameters are:

Click Next to Open or Save the sequence list of merged UMI reads. It is also possible to generate a report that will indicate how many reads were ignored and the reason why they were not included in a UMI read. The report also contains group size (see UMI group sizes) and quality score statistics useful for QC.

This report can be used together with the Combine Reports tool (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Combine_Reports.html)