De novo assembly
The de novo assembly algorithm of CLC Genomics Workbench offers comprehensive support for a variety of data formats, including both short and long reads, and mixing of paired reads (both insert size and orientation).s
The de novo assembly process has two stages:
- First, simple contig sequences are created by using all the information that are in the read sequences. This is the actual de novo part of the process. These simple contig sequences do not contain any information about which reads the contigs are built from. This part is elaborated in How it works.
- Second, all the reads are mapped using the simple contig sequence as reference. This is done in order to show e.g. coverage levels along the contigs and enabling more downstream analysis like SNP detection and creating mapping reports. Note that although a read aligns to a certain position on the contig, it does not mean that the information from this read was used for building the contig, because the mapping of the reads is a completely separate part of the algorithm.
The de novo assembler is not designed to effectively use long mate-pair read information (insert size greater than 10 kp). Such data can be incorporated but may not lead to improvements in the final results. If paired end data are being assembled, the inclusion of mate-pair information in the same assembly can sometimes lead to worse results. In such cases, we advise that the long mate-pair data is marked as single (non-paired) data before including it in the assembly (see General notes on handling paired data).
Subsections
- How it works
- Resolve repeats using reads
- Automatic paired distance estimation
- Optimization of the graph using paired reads
- AGP export
- Bubble resolution
- Converting the graph to contig sequences
- Summary
- Randomness in the results
- SOLiD data support in de novo assembly
- De novo assembly parameters
- De novo assembly report