Mapping paired reads

At the bottom of the dialog shown in figure 25.2 you can specify how Paired reads should be handled. You can read more about how paired data is imported and handled in General notes on handling paired data. If the sequence list used as input contains paired reads, this option will automatically be enabled - if it contains single reads, this option will not be applicable.

The CLC Genomics Workbench offers as the default choice to automatically calculate the distance between the pairs. If this is selected, the distance is estimated in the following way:

  1. A sample of 100000 reads is extracted randomly from the full data set and mapped against the reference using a very wide distance interval.
  2. The distribution of distances between the paired reads is analyzed, and an appropriate distance interval is selected:
    • If less than 10000 reads map, a simple calculation is used where the minimum distance is one standard deviation below the average distance, and the maximum distance is one standard deviation above the average distance.
    • If more than 10000 reads map, a more sophisticated method is used which investigates the shape of the distribution and finds the boundaries of the peak.
  3. The full sample is mapped using this distance interval.
  4. The history (Image history) of the result records the distance interval used.
The above procedure will be run for each sequence list used as input, assuming that they do not necessarily share the same library preparation and could have different distributions of paired distances. Figure 25.3 shows an example of the distribution of intervals before and after the pair estimation.

Image before_and_after_pair_estimation
Figure 25.3: To the left: mapping with a large paired distance interval. To the right: mapping with a narrower distance interval estimated by the workbench.

If the automatic detection of pairs is not checked, the mapper will use the information about minimum and maximum distance recorded on the input sequence lists (see General notes on handling paired data).

We recommend checking the Detailed mapping report and check that the paired distances reported show a nice distribution and that not too many pairs are broken.

When a paired distance interval is set, the following approach is used for determining the placement of read pairs: