De Novo Assemble Small Genome
De Novo Assemble Small Genome facilitates assembly of a microbial genome from short next-generation sequencing reads with high and uniform coverage. It is based on the open source tool SPAdes [Prjibelski et al., 2020].
The tool is equivalent to running SPAdes v4.0.0 with the option --isolate
. Paired reads with forward-reverse orientation are supplied as paired-end libraries; paired reads with reverse-forward orientation are supplied as high-quality mate-pair libraries; all other types of reads are supplied as one single read library. For more details on the options used by SPAdes, please see https://ablab.github.io/spades/.
The tool requires high and uniform coverage across the genome. High coverage means >50x, though lower values may work satisfactorily. You can estimate the coverage of your data as follows:
- Obtain an estimated size of the genome you intend to assemble.
- Obtain the total number of nucleotides in your input data by running QC for Sequencing Reads on your reads, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=QC_Sequencing_Reads.html.
- Divide the estimated size of the genome by the number of nucleotides in your input data.
For small genomes, De Novo Assemble Small Genome will typically produce a higher quality assembly than the De Novo Assembly tool from CLC Genomics Workbench (https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=De_Novo_Assembly.html) but at the cost of taking more time and using more memory. The CLC Genomics Workbench tool should be preferred when coverage is not high, or not uniform, or the sample does not consist solely of whole genome sequencing reads from a bacterial or virus isolate.
Before you begin the Assembly
- Input Data Quality Good quality data is key to a successful assembly. We strongly recommend using the Trim Reads tool:
- Trimming based on quality can reduce the number of sequencing errors that make their way to the assembler. This reduces the number of spurious words generated during an initial assembly phase. This then reduces the number of words that will need to be discarded in the graph building stage.
- Trimming adapters from sequences is crucial for generating correct results. Adapter sequences remaining on sequences can lead to the assembler spending considerable time trying to join regions that are not biologically relevant. In other words this can lead to the assembly taking a long time and yielding misleading results.
For requirements for the De Novo Assemble Metagenome tool, see (System requirements).
Subsections