Input Files

The formats in the following table are recognized as valid input formats by one or more of the CLC Assembly Cell tools. Note that not all listed formats are valid for data to be treated as sequence reads, and not all listed formats are valid for data to be treated as reference sequences in the case of read mappings.

Input file formats are automatically detected by the software through consideration of the file contents. The filename is irrelevant with regards to input format.

Format Reads References
Fasta + +
Fastq + -
csfasta* + -
Sff** + -
GenBank - +

*The full sequence of any read containing one or more . symbols, present in a .csfasta format file, will be converted to contain only N characters when used by or output by any of the Assembly Cell tools.

**Please note that paired 454 data needs to be pre-processed using the clc_split_reads program.

Read and reference data compressed using gzip is supported as input by the CLC Assembly Cell programs except for clc_remove_duplicates and clc_mapper_legacy.

Single reference sequences longer than 2gb ( $ 2 \cdot 10^9$ bases) and reads longer than 100,000 bases are not supported.