Optimization of the graph using paired reads
When paired reads are available, we can use the paired information to resolve large repeat regions that are not spanned by individual reads, but are spanned by read pairs. Given a set of paired reads that align to two nodes connected by a repeat region, the repeat region may be resolved for those nodes if we can find a path connecting the nodes with a length corresponding to the paired read distance. However, such a path must be supported by a minimum of four sets of paired reads before the repeat is resolved.
If it's not possible to resolve the repeat, scaffolding is performed where paired read information is used to determine the distances between contigs and the orientation of these. Scaffolding is only considered between contigs with a minimum length of 120 to ensure that enough paired read information is available. An iterative greedy approach is used when performing scaffolding where short gaps are closed first, thus increasing the paired read information available for closing gaps (see figure 28.11).
Figure 28.11: Performing iterative scaffolding of the shortest gaps allows long pairs to be optimally used. i1 shows three contigs with dashed arches indicating potential scaffolding. i2 is after first iteration when the shortest gap has been closed and long potential scaffolding has been updated. i3 is the final results with three contigs in one scaffold.
Contigs in the same scaffold are output as one large contig with Ns inserted in between. The number of Ns inserted correspond to the estimated distance between contigs which is calculated based on the paired read information. More precisely, for each set of paired reads spanning two contigs a distance estimate is calculated based on the supplied distance between the reads. The average of these distances is then used as the final distance estimate. It is possible to get a negative distance estimate which happens when the paired information indicate that contigs overlap but for some reason could not be joined in the graph. Additional information about repeats being resolved using paired reads and scaffolded contigs is available as annotations on the contig sequences and as summary in the report (see De novo assembly report).For the GFF format there are three types of annotations:
- Scaffold refers to the estimated gap region between two contigs where Ns are inserted.
- Contigs joined refers to the join of two contigs connected by a repeat or another ambiguous structure in the graph which was resolved using paired reads.
- Alternatives excluded refers to the exclusion of a region in the graph using paired reads which resulted in a join of two contigs.
The AGP annotations describe the components which an assembly consists of. Currently we output two types of annotations in AGP format:
- Contig a non-redundant sequence not containing any scaffolded regions.
- Scaffold the estimated gap region between two contigs where Ns are inserted.