UMI grouping
All of the LigthSpeed tools can group reads based on Unique Molecular Identifiers (UMIs).
The UMI sequence is recorded and removed from the reads before trimming and mapping. After the reads have been mapped, reads with similar UMI sequence and mapping position are merged into a consensus UMI read.
The consensus is calculated following these rules:
- At conflicting positions, the most common base is included in the consensus read.
- If the conflicting bases are equally represented the consensus can be generated in two ways:
- When one of the bases at the conflicting position is identical to the reference symbol, the reference symbol is included in the consensus read.
- When none of the bases at the conflicting position is identical to the reference symbol, an N is inserted in the consensus read.
The following options can be used to adjust how raw reads are grouped into UMI reads:
- Minimum group size UMI reads must consist of at least this many raw reads. UMI reads based on fewer reads than the minimum group size are discarded.
- Maximum UMI differences Only reads that have this number or fewer differences between their UMI sequences can be merged into UMI reads.
- UMI window size Only reads that start within this many bases of each other, can be merged into UMI reads. Both R1 and R2 are considered.
Note that the maximum number of reads used for creating a UMI consensus read is 20,000. Therefore, UMI groups with more than 20,000 reads will be merged into more than one consensus UMI read.