Map Reads to Reference
Read mapping is a very fundamental step in most applications of high-throughput sequencing data. The CLC Genomics Workbench includes read mapping in several other tools (such as in the Map Reads to Contigs tool, or for RNA-Seq Analysis), but this chapter will focus on the core read mapping algorithm. At the end of the chapter you can find descriptions of the read mapping reports and a tool to merge read mappings.
There are two different versions of the core mapper: one for color space data, and one for base space data. At http://www.qiagenbioinformatics.com/support/resources/ you can find white papers with detailed benchmarks and descriptions of both algorithms.
In addition, the mapper has been improved to work with PacBio reads and reads longer than 500bp. Before the Map Reads to Reference tool starts to map the reads, it checks the input sequence list(s) to decide on the mapping algorithm to use:
- If color space information is available, then the reads are mapped in color space. This is done using the legacy mapper (which is the standard mapper on the CLC Genomics Workbenches version 5.5 and earlier), which does not make use of the read group information.
- If no color space information is available, and the input sequence list(s) have the read group set to "PacBio", then the specialized mapping algorithm which is better suited for mapping long reads with many sequencing errors is applied.
- If no color space information is available, and the input sequence list(s)' read group is not set to "PacBio", then the reads are mapped using our standard mapping algorithm. The standard mapping algorithm uses the same seeding method for all input reads, but different extension methods for long (>500 bp) and short reads.
It is possible to mix sequence list that have the read group "PacBio" with sequence lists that have a different read group for the same mapping. In this case the appropriate mapping algorithm will be applied to each of the sequence list.
In contrast it is not possible to mix color space and base space data.
Subsections
- Selecting reads and reference
- Including or excluding regions (masking)
- Mapping parameters
- Mapping paired reads
- Non-specific matches
- Gap placement
- Mapping computational requirements
- Reference caching