De novo assembly

The de novo assembly algorithm of CLC Genomics Workbench offers comprehensive support for a variety of data formats, including both short and long reads, and mixing of paired reads (both insert size and orientation).s

The de novo assembly process has two stages:

  1. First, simple contig sequences are created by using all the information that are in the read sequences. This is the actual de novo part of the process. These simple contig sequences do not contain any information about which reads the contigs are built from. This part is elaborated in How it works.
  2. Second, all the reads are mapped using the simple contig sequence as reference. This is done in order to show e.g. coverage levels along the contigs and enabling more downstream analysis like SNP detection and creating mapping reports. Note that although a read aligns to a certain position on the contig, it does not mean that the information from this read was used for building the contig, because the mapping of the reads is a completely separate part of the algorithm.
If you wish to only have the simple contig sequences as output, this can be chosen when starting the de novo assembly (see De novo assembly parameters).

The de novo assembler is not designed to effectively use long mate-pair read information (insert size greater than 10 kp). Such data can be incorporated but may not lead to improvements in the final results. If paired end data are being assembled, the inclusion of mate-pair information in the same assembly can sometimes lead to worse results. In such cases, we advise that the long mate-pair data is marked as single (non-paired) data before including it in the assembly (see General notes on handling paired data).



Subsections