## Automatic paired distance estimation

The default behavior of the de novo assembler is to use the paired distances provided by the user. If the automatic paired distance estimation is enabled, the assembler will attempt to estimate the distance between paired reads. This is done by analysing the mapping of paired reads to the long unambiguous paths in the graph which are created in the read optimization step described above. The distance estimation algorithm creates a histogram () of the paired distances between reads in each set of paired reads (see figure 5.11). Each of these histograms are then used to estimate paired distances as described in the following.

We denote the average number of observations in the histogram where is the number of observations (reads) with distance and is the number of bins in . The gradient of at distance is denoted . The following algorithm is then used to compute a distance interval for each histogram.

• Identify peaks in as where is any interval in where .
• For the two largest peaks found, expand the respective intervals to where . I.e. we search for a point in both directions where the number of observations becomes stable. A window of size 5 is used to calculate in this step.
• Compute the total number of observations in each of the two expanded intervals.
• If only one peak was found, the corresponding interval is used as the distance estimate unless the peak was at a negative distance in which case no distance estimate is calculated.
• If two peaks were found and the interval for the largest peak contains less than 1% of all observations, the distance is not estimated.
• If two peaks were found and the interval for the largest peak contains 2X observations compared to the smaller peak, the distance estimate is only computed if the range of distances is positive for the largest peak and negative for the smallest peak. If this is the case the interval for the positive peak is used as a distance estimate.
• If two peaks were found and the largest peak has 2X observations compared to the smaller peak, the interval corresponding to the largest peak is used as the distance estimate.

Figure 5.10: Histogram of paired distances where tex2html_wrap_inline$H_avg$ is indicated by the horizontal dashed line. There is two peaks, one is at a negative distance while the other larger peak is at a positive distance. The extended interval tex2html_wrap_inline$[k,l]$ for each peak is indicated by the vertical dotted lines.

If a distance estimate for a data set is deemed unreliable, the estimate is ignored and replaced by the distance supplied by the user using the -p' option for that data set. The -e' option requires a file name argument ,which is used to output the result of the distance estimation for each dataset. The output is a tab-delimited file containing the estimated distances, if any, and a status code for each data set. The possible status codes are: itemize DISTANCE_ESTIMATED The distance interval was estimated and used for scaffolding. NO_DATA No or very few reads were mapped as paired reads. NOT_ENOUGH_DATA Not enough reads were mapped as paired reads to give a reliable distance estimate. NEGATIVE_DISTANCE The distance interval was in the negative range which is usually caused by either wrong orientation of the reads or paired-end contamination in a mate-pair data set. AMBIGIOUS_DISTANCE Several possible distance intervals were detected but there was not enough data to select the correct one. WRONG_DIRECTION The orientation of the reads was not set correctly.

Only distance estimates with the DISTANCE_ESTIMATED status code is used for the assembly. In general we do not recommend that the automatic paired distance estimation is used on mate-pair reads where the expected distance is larger than 10Kbp as the distance estimate will often either fail or be inaccurate.