De Novo Assemble PacBio Reads

Please note that the tools Correct PacBio Reads and De Novo Assemble PacBio Reads are optimized for the use of conventional PacBio data and readily support data generated with different generations of PacBio chemistry (sequencing reagents). However, these tools are not suitable for PacBio HiFi (circular consensus) reads or other data types. Moreover, for the tool Correct PacBio Reads we are relying on certain methods which are the intellectual property of Pacific Biosciences. The use of the Correct PacBio Reads tool or the predefined workflow PacBio De Novo Assembly Pipeline with any data other than data generated on a Pacific Biosciences instrument constitutes a violation of the end user license agreement that users of the CLC Genome Finishing Module agree to during installation.

SMRT sequencing technologies, as implemented by Pacific BiosciencesTM, have the potential to vastly improve the completeness of genome sequence assemblies, as read lengths often exceed the length of most repeats in the genome. A major obstacle is the high (10-15%) rate of sequencing errors in SMRT reads. A second obstacle is the presence of chimeric reads and sequences derived from untrimmed adapters, which can be hard to recognize given the rate of errors and truncations. However, because sequencing errors are mostly random and reads are randomly sampled across the genome, it is possible to i) correct SMRT sequencing reads if coverage is sufficiently high and ii) assemble the error-corrected reads into high-quality contigs.

The Correct PacBio Reads tool performs the first of these two tasks: It takes raw PacBio reads as input and produces error-corrected reads as output. The De Novo Assemble PacBio Reads tool performs the second task: assembling the error-corrected reads into high-quality contigs. Both tools are designed for microbial genomes and small Eukaryotic genomes (for example C. elegans with a 100Mb genome).

Assembly of the error-corrected PacBio reads is done using a de Bruijn graph based approach [Pevzner et al., 2001] but uses a number of novel techniques to close gaps in the graph, correct discrepancies in the graph and finally solve the graph. The use of a de Bruijn graph in contrast to a string overlap graph, as in for example PacBio's HGAP [Chin et al., 2013], results in an extremely fast and memory efficient assembler.