An example including error-correction for PacBio reads
Say we start out having a BAM file with raw reads from a PacBio instrument, raw_reads.bam
. First thing we need to do is extract the reads in FASTA format using samtools:
samtools fasta -0 raw_reads.fa raw_reads.bamWe now have the reads in the FASTA file
raw_reads.fa
and will no longer be needing the original BAM file.
Next, we error-correct the raw PacBio reads:
clc_correct_pacbio_reads -q raw_reads.fa -o corrected_reads.faWe are now ready to start the assembly. We want to use the corrected reads (
corrected_reads.fa
) for the assembly itself and use the raw reads (raw_reads.fa
) for polishing.
clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa
Let us assume that we have inspected the result in contigs.fa
and are unhappy with the quality of the result; perhaps we would like to see fewer contigs or would like a total contig length closer to that of our expected genome length. We can then re-run the same assembly with slightly different parameters and opt to save the graph, in case we need to do more runs:
clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa --min-coverage 4 --savegraph raw_reads.graphSay we inspect the result again and are still not quite happy. Maybe we want to raise the anchor length because we expect there to be some unwanted noise in the graph. Because we have saved the graph, we can cut run-time significantly on this run by re-loading the saved graph:
clc_assembler_long -q corrected_reads.fa -r raw_reads.fa -o contigs.fa --min-coverage 4 --loadgraph raw_reads.graph --anchor-length 200If the data is good and the sample genome is tractable, we should end up with a decent result after a few iterations of this procedure.