Automatic paired distance estimation

The default behavior of the de novo assembler is to use the paired distances provided by the user. If the automatic paired distance estimation is enabled, the assembler will attempt to estimate the distance between paired reads. This is done by analysing the mapping of paired reads to the long unambiguous paths in the graph which are created in the read optimization step described above. The distance estimation algorithm creates a histogram ($ H$ ) of the paired distances between reads in each set of paired reads (see figure 28.11). Each of these histograms are then used to estimate paired distances as described in the following.

We denote the average number of observations in the histogram $ H_{avg} = \dfrac{1}{\vert H\vert}\Sigma_{d} H(d)$ where $ H(d)$ is the number of observations (reads) with distance $ d$ and $ \vert H\vert$ is the number of bins in $ H$ . The gradient of $ H$ at distance $ d$ is denoted $ H'(d)$ . The following algorithm is then used to compute a distance interval for each histogram.

Image hist_distance
Figure 28.11: Histogram of paired distances where tex2html_wrap_inline$H_avg$ is indicated by the horizontal dashed line. There is two peaks, one is at a negative distance while the other larger peak is at a positive distance. The extended interval tex2html_wrap_inline$[k,l]$ for each peak is indicated by the vertical dotted lines.