The clc_correct_pacbio_reads tool (beta)
The clc_correct_pacbio_reads
tool performs error-correction of PacBio reads.
IMPORTANT NOTICE: This tool relies on certain methods that are the intellectual property of Pacific Biosciences. Consequently, the use of this tool with any data other than data generated on a Pacibic Biosciences instrument constitutes a violation of the end-user license agreement that users of the CLC Assembly Cell agree to during installation.
A typical PacBio run produces a wide range of different read lengths. All other things being equal, the longer a read is, the more useful it is for de novo assembly. The primary reason for this is the ability of long reads to span longer repeats and connect with more unique sequence surrounding the repeat, clearly delimiting that repeat region in the final assembly.
Raw PacBio reads exhibit a much higher rate of (random) errors than short read technologies, such as Illumina. Unaddressed, this added noise would confuse the assembler, leading to a poor assembly.
The clc_correct_pacbio_reads
tool performs optional, but highly recommended, pre-processing of the raw reads that alleviates this problem. It takes as input the raw reads and produces a new FASTA file5.3 containing error-corrected versions of those long reads.
The error correction itself is a step-wise process. Somewhat simplified, it consists of three steps:
- Step 1. The raw reads are split into two classes: long reads (or seed reads) and short reads (or correction reads). The way the split is made is governed by the
--fraction
parameter. By default, the longest reads, corresponding to 30% of the total read length, become the seed reads. - Step 2. The short reads are then mapped against the long reads. Typically, several short reads end up covering each position of the long read.
- Step 3. Finally, each long read is considered along with the corresponding short reads to form a consensus sequence by per-position majority vote. Because the noise in the raw sequences is close to random, much of that noise is eliminated by this process, and we can consider the consensus sequence a corrected version of the read.
Footnotes
- ... file5.3
- The corrected reads will not have quality scores, as it is unclear how to calculate these in any meaningful way. However, this is not a problem, as quality scores are not used in the subsequent assembly process.
Subsections