Overview

The large gap mapper maps reads to a reference, while allowing for large gaps in the mapping. It is developed to support transcript discovery using RNA-seq data, since it is able to map RNA-seq reads that span introns without requiring prior transcript annotations. However, it may equally well be used for mapping any other type of reads, for example to find large deletions in genomic data.

The large gap mapper works by iteratively applying the standard read mapper of the CLC Genomics Workbench to each read as follows:

  1. Find the best match for the read.
  2. If the match is good enough (according to the settings, see below), the read is mapped to this position.
  3. If there is an unaligned end which is long enough for the mapper to handle (17 bp for standard mapping, 18 bp for mapping in color space), this part of the read is used as input to step 1.
  4. This continues until no reads have unaligned ends that are longer than 17/18 bp. Usually for 100 bp reads it will be maximum three rounds of mapping (corresponding to spanning two introns).

The matched region of the read identified in the first round of the mapping is called the seed segment (or just 'seed'). Matched regions found in later rounds are called non-seed segments.