References and indexes
The read mapper supports mapping against a mixture of linear and circular reference sequences.
Reference sequences are typically provided in the form of one or more FASTA files4.1, that can be passed to the mapper using the -d/--references
parameter. Files containing sequences, to be interpreted as circular, must be individually prefixed with the -z/--circular
parameter:
clc_mapper -d linear.fa -z circular.fa linear_again.fa ...Internally, the provided sequences are converted into a reference index, allowing the read mapper to efficiently search and navigate them.
If you find yourself working repeatedly with the same large genomes, such as human, you can save a lot of time by writing the index to disk using the -n/--indexoutput
parameter, eg.
clc_mapper -d chr1.fa chr2.fa ... chrY.fa -z mito.fa -n human.idx ...The next time you need to use that human reference, simply provide it as a reference:
clc_mapper -d human.idx ...Note, that the size of the index, both on disk and in memory, is comparable to the cumulative size of the reference sequences in FASTA format. It typically uses slightly less than one byte per base. For example, a human index needs around 2.8 gigabytes of space.