General notes on UMIs
Some NGS library preparation protocols use Unique Molecular Indexes (UMIs) to improve performance by, for example
- identifying and correcting sequencing errors to allow higher sensitivity in variant calling
- eliminating library amplification and sequencing biases in RNA quantification
UMIs are usually located on the reads. The UMIs on the imported reads can be processed by tools delivered by the Biomedical Genomics Analysis plugin.
Various platforms offer the option to remove the UMIs and the information is instead added to the read headers in the fastq file. UMIs are extracted from read headers during import if the header of the first read in the file contains UMI information in one of the following two formats:
<sample info>:<UMI> <read number>:<more sample info>
<sample info>_<UMI> <read number>:<more sample info>
The read header must contain exactly one space, between the <UMI>
and <read number>
. The imported sequences are annotated with the <UMI>
. The allowed characters in the <UMI>
are A
, C
, G
, T
and N
. For paired reads, the <UMI>
may contain one +
(plus sign), separating the UMIs for each read in the pair, in which case the reads are annotated with the concatenated UMIs, i.e. the <UMI>
without the +
.