Best practices
The de novo assembler is not designed to effectively use long mate-pair read information (insert size greater than 10 kp). Such data can be incorporated but may not lead to improvements in the final results. If paired end data are being assembled, the inclusion of mate-pair information in the same assembly can sometimes lead to worse results. In such cases, we advise that the long mate-pair data is marked as single (non-paired) data before including it in the assembly (see General notes on handling paired data).
Before you begin the Assembly
- Input Data Quality Good quality data is key to a successful assembly. We strongly recommend using the Trim Reads tool:
- Trimming based on Quality can reduce the number of sequencing errors input to the assembler. This reduces the number of spurious words generated during an initial assembly phase. This then lowers the overhead of dealing with the consequences of sequencing errors in later phases of the assembly. Specifically, this reduces the number of words that will need to be discarded in the graph building stage.
- Trimming Adapters from sequences is crucial for generating correct results. Adapter sequences remaining on sequences can lead to the assembler spending much time trying to join regions that are not biologically relevant. In other words this can lead to the assembly taking a long time and yielding misleading results.
- Input Data Quantity In the case of de novo assembly, more data does not always lead to a better result. Very high coverage in a location will increase the probability this location is seen as a sequencing error. This is not good because overlapping sequencing errors can result in poor assembly quality. We therefore recommend to use data sets with an average read coverage less than 100x.
If you expect the average coverage of your genome to be greater than 100x, you can use the Sample Reads tool to reduce coverage. To determine how many reads you need to sample to obtain a maximum average coverage of 100x, you can do the following calculation:
- Obtain an estimated size of the genome you intend to assemble.
- Multiply this genome size by 100. This value will be the total number of bases you should use as input for assembly.
- Divide the total input bases the average length of your sequencing reads.
- Use this number as input for the number of reads to obtain as output from the Sample Reads tool
Running the Assembly The two parameters that can be adjusted to improve assembly quality are Word Size and Bubble Size.
The default values for these parameters can work reasonably well on a range of data sets, but we recommend that you choose and evaluate these values based on what you know about your data.
- Word Size If you expect your data to contain long regions of high quality, a larger Word Size, such as a value above 30 is recommended. If your data has a higher error rate, as in cases when homopolymer errors are common, a Word Size below 30 is recommended. Whenever possible, the Word Size should be less than the expected number of bases between sequencing errors.
- Bubble Size When adjusting Bubble Size, the repeat structure of your genome should be considered in conjunction with the sequence quality. If you do not expect a repetitive genome you may wish to choose a higher bubble size to improve contiguity. If you anticipate more repeats, you may wish to use a smaller Bubble Size to reduce the possibility of collapsing repeat regions. In cases where the sequence quality is not high a larger bubble size may make more sense for your data.
However, comparing the results of multiple assemblies is often a challenge. For example, you may have one assembly with a large N50 and another with a larger total contig length. How do you decide which is better? Is the one with the large contig sizes better or the one with more total sequence? Ultimately, the answer to these questions will depend on what the goal of your downstream analysis is. To help with this comparison, we provide some basic guidelines in the sections below.
Evaluating and Refining the Assembly
Three key points to look for in assessing assembly quality are contiguity, completeness, and correctness.
- Contiguity: How many contigs are there?
A high N50 and low number of contigs, relative to your expected number of chromosomes is ideal. If you aren't sure what type of N50 and contig number might be reasonable to expect, you could try to get an idea by looking at existing assemblies of a similar genome, should these exist. For an even better sense of what would be reasonable for your data, you could make comparisons to an assembly of a similar genome, assembled using a similar amount and type of data. If your assembly results include a large number of very small contigs, it may be that you set the minimum contig length filter too low. Very small contigs, particularly those of low coverage, can generally be ignored.
- Completeness: How much of the genome is captured in the assembly?
If a total genome length of 5MB is expected based on existing literature or similar genomes that have already been assembled, but the sum of all contig lengths is only 3.5MB, you may wish to reconsider your assembly parameters.
Two common reasons for an assembly output that is shorter than expected are:
- A Word Size that is higher than optimal for your data: A high Word Size will increase the probability of discarding words because they overlap with sequencing errors. If a word is seen only once, the unique word will be discarded, even if there exist many other words that are identical except for one base (eg. a sequencing error). A discarded word will not be considered in constructing the assembly graph and will therefore be excluded from the assembly contig sequences.
- A Bubble Size that is higher than optimal for your data: A high Bubble Size will increase the probability that two similar sequences are classified as a repeat and thus collapsed into a single contig. It is sometimes possible to identify collapsed repeats by looking at the mapping of your reads to the assembled contigs. A collapsed repeat will show as a high peak of coverage in one location.
- Correctness Do the contigs that have been assembled accurately represent the genome?
One key question in assessing correctness is whether the assembly is contaminated with any foreign organism sequence data. To check this, you could run a BLAST search using your assembled contigs as query sequences against a database containing possible contaminant species data. In addition to BLAST, checking the coverage can help to identify contaminant sequence data. The coverage of a contaminant contig is often different from the desired organism so you can compare the potential contaminant contigs to the rest of the assembled contigs. You may check for these types of coverage differences between contigs by:
- Map your reads use as input for de novo assembly to your contigs (if you do not already have mapping output)
- Create a Detailed Mapping Report
- In the Result handling step of the wizard, check the option to Create separate table with statistics for each mapping
- Review the average coverage for each contig in this resulting table.
Assessing the correctness of an assembly also involves making sure the assembler did not join segments of sequences that should not have been joined - or checking for mis-assemblies. This is more difficult. One option for identifying mis-assemblies is to try running the InDels and Structural Variants tool. If this tool identifies structural variation within the assembly, that could indicate an issue that should be investigated.
Post assembly improvements
If you are working with a smaller genome, the CLC Genome Finishing Module may be of interest to you. It has been developed to help finishing small genomes, such as microbes, eukaryotic parasites, or fungi, in order to reduce the extensive workload associated with genome finishing and to facilitate as many steps in the procedure as possible. Request a free trial at https://www.qiagenbioinformatics.com/products/clc-genome-finishing-module/.