Automatic paired distance estimation

The default behavior of the de novo assembler is to use the paired distances provided by the user. If the automatic paired distance estimation is enabled, the assembler will attempt to estimate the distance between paired reads. This is done by analysing the mapping of paired reads to the long unambiguous paths in the graph which are created in the read optimization step described above. The distance estimation algorithm creates a histogram ($ H$) of the paired distances between reads in each set of paired reads (see figure 5.11). Each of these histograms are then used to estimate paired distances as described in the following.

We denote the average number of observations in the histogram $ H_{avg} = \dfrac{1}{\vert H\vert}\Sigma_{d} H(d)$ where $ H(d)$ is the number of observations (reads) with distance $ d$ and $ \vert H\vert$ is the number of bins in $ H$. The gradient of $ H$ at distance $ d$ is denoted $ H'(d)$. The following algorithm is then used to compute a distance interval for each histogram.

Image hist_distance
Figure 5.10: Histogram of paired distances where tex2html_wrap_inline$H_avg$ is indicated by the horizontal dashed line. There is two peaks, one is at a negative distance while the other larger peak is at a positive distance. The extended interval tex2html_wrap_inline$[k,l]$ for each peak is indicated by the vertical dotted lines.

If a distance estimate for a data set is deemed unreliable, the estimate is ignored and replaced by the distance supplied by the user using the `-p' option for that data set. The `-e' option requires a file name argument ,which is used to output the result of the distance estimation for each dataset. The output is a tab-delimited file containing the estimated distances, if any, and a status code for each data set. The possible status codes are: itemize DISTANCE_ESTIMATED The distance interval was estimated and used for scaffolding. NO_DATA No or very few reads were mapped as paired reads. NOT_ENOUGH_DATA Not enough reads were mapped as paired reads to give a reliable distance estimate. NEGATIVE_DISTANCE The distance interval was in the negative range which is usually caused by either wrong orientation of the reads or paired-end contamination in a mate-pair data set. AMBIGIOUS_DISTANCE Several possible distance intervals were detected but there was not enough data to select the correct one. WRONG_DIRECTION The orientation of the reads was not set correctly.

Only distance estimates with the DISTANCE_ESTIMATED status code is used for the assembly. In general we do not recommend that the automatic paired distance estimation is used on mate-pair reads where the expected distance is larger than 10Kbp as the distance estimate will often either fail or be inaccurate.