QIAGEN Bioinformatics Manuals

An example including error-correction for PacBio reads

Say we start out having a BAM file with raw reads from a PacBio instrument, raw_reads.bam. First thing we need to do is extract the reads in FASTA format using samtools:

samtools fasta -0 raw_reads.fa raw_reads.bam

We now have the reads in the FASTA file raw_reads.fa and will no longer be needing the original BAM file.

Next, we error-correct the raw PacBio reads:

clc_correct_pacbio_reads -q raw_reads.fa -o corrected_reads.fa

We are now ready to start the assembly. We want to use the corrected reads (corrected_reads.fa) for the assembly itself and use the raw reads (raw_reads.fa) for polishing.

clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa

Let us assume that we have inspected the result in contigs.fa and are unhappy with the quality of the result; perhaps we would like to see fewer contigs or would like a total contig length closer to that of our expected genome length. We can then re-run the same assembly with slightly different parameters and opt to save the graph, in case we need to do more runs:

clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa
                   --min-coverage 4 --savegraph raw_reads.graph

Say we inspect the result again and are still not quite happy. Maybe we want to raise the anchor length because we expect there to be some unwanted noise in the graph. Because we have saved the graph, we can cut run-time significantly on this run by re-loading the saved graph:

clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa
                   --min-coverage 4 --loadgraph raw_reads.graph
                   --anchor-length 200

If the data is good and the sample genome is tractable, we should end up with a decent result after a few iterations of this procedure.

Browse the manual

An example including error-correction for PacBio reads