SOLiD data support in de novo assembly
SOLiD sequencing is done in color space. When viewed in nucleotide
space this means that a single sequencing error changes the remainder
of the read. An example read is shown in figure 5.18.
Figure 5.17: How an error in color space leads to a phase shift and subsequent problems for the rest of the read sequence
Basically, this color error means that C's become A's and A's become C's. Likewise for G's and T's. For the three different types of errors, we get three different ends of the read. Along with the correct reads, we may get four different versions of the original genome due to errors. So if SOLiD reads are just regarded in nucleotide space, we get four different contig sequences with jumps from one to another every time there is a sequencing error.
Thus, to fully accommodate SOLiD sequencing data, the special nature of
the technology has to be considered in every step of the assembly
algorithm. Furthermore, SOLiD reads are fairly short and often quite
error prone. Due to these issues, we have chosen not to include SOLiD
support in the first algorithm steps, but only use the SOLiD data
where they have a large positive effect on the assembly process: when
applying paired information.
Thus, the clc_assembler
program
has a special option ("-p d") to indicate that a certain data set
should be used only for its paired information. This option
should always be applied to SOLiD data. It is also useful for data
sets of other types with many errors. The errors might have the effect
of confusing the initial graph building more than improving it. But
the paired information is still valuable and can be used with this
option.