SMRT sequencing technologies, as implemented by Pacific BiosciencesTM, have the potential to vastly improve the completeness of genome sequence assemblies, as read lengths often exceed the length of most repeats in the genome. A major obstacle is the high (10-15%) rate of sequencing errors in SMRT reads. A second obstacle is the presence of chimeric reads and sequences derived from untrimmed adapters, which can be hard to recognize given the rate of errors and truncations. However, because sequencing errors are mostly random and reads are randomly sampled across the genome, it is possible to i) correct SMRT sequencing reads if coverage is sufficiently high and ii) assemble the error-corrected reads into high-quality contigs.
The Correct PacBio Reads tool performs the first of these two tasks: It takes raw PacBio reads as input and produces error-corrected reads as output. The De Novo Assemble PacBio Reads tool performs the second task: assembling the error-corrected reads into high-quality contigs. Both tools are designed for microbial genomes and small Eukaryotic genomes (for example C. elegans with a 100Mb genome).
Assembly of the error-corrected PacBio reads is done using a de Bruijn graph based approach [Pevzner et al., 2001] but uses a number of novel techniques to close gaps in the graph, correct discrepancies in the graph and finally solve the graph. The use of a de Bruijn graph in contrast to a string overlap graph, as in for example PacBio's HGAP [Chin et al., 2013], results in an extremely fast and memory efficient assembler.