Calculate Unique Molecular Index Groups
The tool Calculate Unique Molecular Index Groups annotates the mapped reads with a "Unique Molecular Index group ID", that is identical for reads that are determined to belong to the same UMI. The tool can be found in the Toolbox at:
Toolbox | Biomedical Genomics Analysis () | UMI Tools () | Calculate Unique Molecular Index Groups ()
In the first dialog (figure 4.2), select a read mapping of reads that were previously annotated with UMI annotations.
Figure 4.2: Select a read mapping made from reads whose UMI was removed and annotated on the sequences.
The grouping of reads into UMI groups works as follows:
- The tool groups reads that
- start at the same position based on the end of the read to which the UMI is ligated depending on which read structure was used in the Remove and Annotate with Unique Molecular Index tool, (If the UMI was removed from the start of read 2 using the Remove and Annotate with Unique Molecular Index tool, this tool considers grouping reads where the start of read 2 map to the same position)
- are from the same strand, and
- have identical UMIs.
- Their start positions are sufficiently close as defined by the Window size parameter.
- Their UMIs are similar enough as defined by the Fuzzy match Unique Molecular Indices parameter.
Merging is only done if the larger group is sufficiently large compared to the smaller group as defined by the parameters described below. If a smaller group can be merged into multiple larger groups that are equally good in terms of similarity of UMI and start position as well as group size, the group will not be merged.
It is possible to change the following parameters (figure 4.3):
Figure 4.3: Select a read mapping made from reads whose UMI was removed and annotated on the sequences.
- Fuzzy match Unique Molecular Indices Method for deciding which UMIs are considered similar enough for merging:
- Do not fuzzy match Groups will be merged only if they match exactly.
- Allow one mismatch Groups will be merged if they are at most one mismatch apart.
- Allow one mismatch/deletion/insertion Groups will be merged if they are at most one mismatch, deletion or insertion apart.
- Distance Groups will be merged if their edit distance (also called Levenshtein distance), is smaller than the value given in Max UMI distance. Note that if the distance is greater than 1, the groups also have to satisfy a stricter requirement for ratio between their sizes.
- Max UMI distance The maximum edit distance allowed if Distance is selected as Fuzzy match Unique Molecular Indices.
- Exclude ambiguously mapped reads is checked by default.
- Maximum relative size difference between merged groups will merge small groups into bigger ones if the size ratio between the two groups is smaller than a certain value (set at 0.1 per default). We define the distance between two groups as the number of differences in their UMIs (which can only be greater than one if "Distance" is chosen as Fuzzy match Unique Molecular Indices) plus one if their start positions are not the same. The size ratio parameter is taken to the power of the distance, i.e. if the distance is two, the smaller group size should be at most the size of the larger group.
- Always merge singleton groups When this option is checked, a singleton UMI group, a group that contain only one read, is merged with a non-singleton group with distance 1 even if the "Maximum relative size difference between merged groups" threshold is not met.
- Window size Groups will be merged if the difference between their start positions are less than this.
Click Next to choose whether to Open or Save the resulting read mapping of reads which now have a "UMI group ID" annotation.
A report can also be generated. It contains:
- A summary table with the following information:
- Reads in input: Reads that were aligned to the reference
- Reads mapped multiple places (discarded): Reads that aligned to the reference in multiple places, and thus discarded
- Groups merged
- Groups not merged due to >1 candidate of equals size
- Group size table and plots described in UMI group sizes.
Note: When the group sizes (the number of reads in UMI groups) are very large (in most cases more than 10 reads in a UMI group is not desirable), this can indicate problems, such as quality issues with the sample. It can also indicate that the sequencing depth could be reduced.
This report can be used together with the Combine Reports tool (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Combine_Reports.html)