De novo assembly outputs

The output of the clc_assembler is a fasta file containing all the contig sequences.

This means that there is no information about where the reads are placed, how they align, coverage levels etc. If this information is desired, you can use the clc_mapper or clc_mapper_legacy program and use the newly created contig sequences as references. The cas format file created using the mapping program will contain this sort of information.

If the -f option has been used, then a file containg features related to scaffolding will be generated. Choosing to name the file given as an argument to the -f option with a .agp suffix will generate an AGP format file. This format specification can be found online https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml.

Choosing to name the file given as an argument to the -f option with a .gff suffix will generate a GFF format file. The columns of this file contain the following information:

Column 1: Name of contig Column 2: Source program Column 3: Annotation type (see below) Column 4: Start position Column 5: End position Column 6: Score (see below) Column 7, 8 and 9: no meaning: there to conform to the GFF format.

Further details about Annotation types (column 3)

There are three annotation types that can appear in the third column:

1) Alternatives Excluded: More than one path through the graph was possible in this region but evidence from paired data suggested the exclusion of one or more alternative routes in favor of the route chosen.

2) Contigs Joined: More than one route was possible through the graph such that an unambiguous choice of how to traverse the graph cannot by made. However evidence from paired data supports one of these routes and on this basis, this route is followed to the exclusion of the other(s).

3) Scaffold: The route through the graph is not clear but evidence from paired data supports the connection of two contigs. A single contig is then reported with N characters between the two connected regions. This entity is also known as a scaffold. The number of N characters represents the expected distance between the regions, based on the evidence the paired data.

If one chooses not to scaffold, a resulting gff annotation file will still report any "Contigs joined" and "Alternatives excluded" optimizations, as these are still performed in this case.

Further details about Scores (column 6)

For annotation type Scaffold, the size of the gap that has been estimated between scaffolded sections of the contig is reported in the score column.

For annotation type Alternatives Excluded, the score is reported as the (word size + 1). This value merely serves as a reminder that the region reported for this event is associated with the word size used for the assembly.

For annotation type Contigs Joined, the value in the score column is 0.