QIAGEN Bioinformatics Manuals

Automatic paired distance estimation

The default behavior of the de novo assembler is to use the paired distances provided by the user. If the automatic paired distance estimation is enabled, the assembler will attempt to estimate the distance between paired reads. This is done by analysing the mapping of paired reads to the long unambiguous paths in the graph which are created in the read optimization step described above. The distance estimation algorithm creates a histogram (

) of the paired distances between reads in each set of paired reads (see figure 30.11). Each of these histograms are then used to estimate paired distances as described in the following.

We denote the average number of observations in the histogram $H_{avg} = \dfrac{1}{\vert H\vert}\Sigma_{d} H(d)$ where is the number of observations (reads) with distance and $\vert H\vert$ is the number of bins in . The gradient of at distance is denoted . The following algorithm is then used to compute a distance interval for each histogram.

Identify peaks in as $\max_{i\leq d \leq j}H(d)$ where is any interval in where $\lbrace H(d) \geq \dfrac{H_{avg}}{2} \vert i \leq d \leq j\rbrace$ .
For the two largest peaks found, expand the respective intervals to where $H'(k) < 0.001 \wedge k \leq i \wedge H'(l) > -0.001 \wedge j \leq l$ . I.e. we search for a point in both directions where the number of observations becomes stable. A window of size 5 is used to calculate in this step.
Compute the total number of observations in each of the two expanded intervals.
If only one peak was found, the corresponding interval is used as the distance estimate unless the peak was at a negative distance in which case no distance estimate is calculated.
If two peaks were found and the interval for the largest peak contains less than 1% of all observations, the distance is not estimated.
If two peaks were found and the interval for the largest peak contains 2X observations compared to the smaller peak, the distance estimate is only computed if the range of distances is positive for the largest peak and negative for the smallest peak. If this is the case the interval for the positive peak is used as a distance estimate.
If two peaks were found and the largest peak has $\geq$ 2X observations compared to the smaller peak, the interval corresponding to the largest peak is used as the distance estimate.

Image hist_distance
Figure 30.11: Histogram of paired distances where tex2html_wrap_inline$H_avg$ is indicated by the horizontal dashed line. There is two peaks, one is at a negative distance while the other larger peak is at a positive distance. The extended interval tex2html_wrap_inline$[k,l]$ for each peak is indicated by the vertical dotted lines.

Browse the manual

Automatic paired distance estimation