How it works
If a distance estimate for a data set is deemed unreliable, the estimate is ignored and replaced by the distance supplied by the user using the `-p' option for that data set. The `-e' option requires a file name argument ,which is used to output the result of the distance estimation for each dataset. The output is a tab-delimited file containing the estimated distances, if any, and a status code for each data set. The possible status codes are:- DISTANCE_ESTIMATED The distance interval was estimated and used for scaffolding.
- NO_DATA No or very few reads were mapped as paired reads.
- NOT_ENOUGH_DATA Not enough reads were mapped as paired reads to give a reliable distance estimate.
- NEGATIVE_DISTANCE The distance interval was in the negative range which is usually caused by either wrong orientation of the reads or paired-end contamination in a mate-pair data set.
- AMBIGIOUS_DISTANCE Several possible distance intervals were detected but there was not enough data to select the correct one.
- WRONG_DIRECTION The orientation of the reads was not set correctly.
Only distance estimates with the DISTANCE_ESTIMATED status code is used for the assembly. In general we do not recommend that the automatic paired distance estimation is used on mate-pair reads where the expected distance is larger than 10Kbp as the distance estimate will often either fail or be inaccurate. CLC bio's de novo assembly algorithm works by using de Bruijn graphs. This is similar to how most new de novo assembly algorithms work [Zerbino and Birney, 2008,Zerbino et al., 2009,Li et al., 2010,Gnerre et al., 2011]. The basic idea is to make a table of all sub-sequences of a certain length (called words) found in the reads. The words are relatively short, e.g. about 20 for small data sets and 27 for a large data set (the word size is determined automatically, see explanation below).
Given a word in the table, we can look up all the potential neighboring words (in all the examples here, word of length 16 are used) as shown in figure 5.1.
Figure 5.0: The word in the middle is 16 bases long, and it shares the 15 first bases with the backward neighboring word and the last 15 bases with the forward neighboring word.
Typically, only one of the backward neighbors and one of the forward neighbors will be present in the table. A graph can then be made where each node is a word that is present in the table and edges connect nodes that are neighbors. This is called a de Bruijn graph.
For genomic regions without repeats or sequencing errors, we get long linear stretches of connected nodes. We may choose to reduce such stretches of nodes with only one backward and one forward neighbor into nodes representing sub-sequences longer than the initial words.
Figure 5.2 shows an example where one node has two forward neighbors:
Figure 5.1: Three nodes connected, each sharing 15 bases with its neighboring node and ending with two forward neighbors.
After reduction, the three first nodes are merged, and the two sets of forward neighboring nodes are also merged as shown in figure 5.3.
Figure 5.2: The five nodes are compacted into three. Note that the first node is now 18 bases and the second nodes are each 17 bases.
So bifurcations in the graph leads to separate nodes. In this case we get a total of three nodes after the reduction. Note that neighboring nodes still have an overlap (in this case 15 nucleotides since the word length is 16).
Given this way of representing the de Bruijn graph for the reads, we can consider some different situations:
When we have a SNP or a sequencing error, we get a so-called bubble (this is explained in detail in Bubble resolution) as shown in figure 5.4.
Figure 5.3: A bubble caused by a heterozygous SNP or a sequencing error.
Here, the central position may be either a C or a G. If this was a sequencing error occurring only once, we would see that one path through the bubble will only be words seen a single time. On the other hand if this was a heterozygote SNP we would see both paths represented more or less equally. Thus, having information about how many times this particular word is seen in all the reads is very useful and this information is stored in the initial word table together with the words.
The most difficult problem for de novo assembly is repeats. Repeat regions in large genomes often get very complex: a repeat may be found thousands of times and part of one repeat may also be part of another repeat. Sometimes a repeat is longer than the read length (or the paired distance when pairs are available) and then it becomes impossible to resolve the repeat. This is simply because there is no information available about how to connect the nodes before the repeat to the nodes after the repeat.
In the simple example, if we have a repeat sequence that is present twice in the genome, we would get a graph as shown in figure 5.5.
Figure 5.4: The central node represents the repeat region that is represented twice in the genome. The neighboring nodes represent the flanking regions of this repeat in the genome.
Note that this repeat is 57 nucleotides long (the length of the sub-sequence in the central node above plus regions into the neighboring nodes where the sequences are identical). If the repeat had been shorter than 15 nucleotides, it would not have shown up as a repeat at all since the word length is 16. This is an argument for using long words in the word table. On the other hand, the longer the word, the more words from a read are affected by a sequencing error. Also, for each extra nucleotide in the words, we get one less word from each read. This is in particular an issue for very short reads. For example, if the read length is 35, we get 16 words out of each read if the word length is 20. If the word length is 25, we get only 11 words from each read.
To strike a balance, CLC bio's de novo assembler chooses a word length based on the amount of input data: the more data, the longer the word length. It is based on the following:
word size 12: 0 bp - 30000 bp word size 13: 30001 bp - 90002 bp word size 14: 90003 bp - 270008 bp word size 15: 270009 bp - 810026 bp word size 16: 810027 bp - 2430080 bp word size 17: 2430081 bp - 7290242 bp word size 18: 7290243 bp - 21870728 bp word size 19: 21870729 bp - 65612186 bp word size 20: 65612187 bp - 196836560 bp word size 21: 196836561 bp - 590509682 bp word size 22: 590509683 bp - 1771529048 bp word size 23: 1771529049 bp - 5314587146 bp word size 24: 5314587147 bp - 15943761440 bp word size 25: 15943761441 bp - 47831284322 bp word size 26: 47831284323 bp - 143493852968 bp word size 27: 143493852969 bp - 430481558906 bp word size 28: 430481558907 bp - 1291444676720 bp word size 29: 1291444676721 bp - 3874334030162 bp word size 30: 3874334030163 bp - 11623002090488 bp etc.This pattern (multiplying by 3) continues until word size of 64 which is the max. Please note that the range of word sizes is 12-24 on 32-bit computers and 12-64 on 64-bit computers. The word size can also be specified manually using the -w option. Using the -v (verbose) option, you can see the word size that is automatically calculated by the assembler
Subsections
- Resolve repeats using reads
- Automatic paired distance estimation
- Optimization of the graph using paired reads
- Bubble resolution
- Converting the graph to contig sequences
- Summary