Map Bisulfite Reads to Reference

Setting up the Map Bisulfite Reads to Reference tool is very similar to the usual read mapping, with a few differences. In particular,

Please note that, because two versions of the reference sequence (C->T and G->A - converted) have to be indexed and used simultaneously for each read, the RAM requirements for the bisulfite mapper are double than those needed for a regular mapping against a reference sequence of the same size.

To start the read mapping:

        Tools | Epigenomics Analysis (Image epigenomics) | Bisulfite Sequencing (Image bisulfite_folder_closed_16_n_p)| Map Bisulfite Reads to Reference (Image map_bisulfite_reads_to_reference_16_n_p)

Selecting reads and directionality

In the first dialog, select the sequences or sequence lists containing the sequencing data (figure 37.15). Please note that reads longer than 100,000 bases are not supported.

Image referenceassembly_bs1
Figure 37.15: Specifying the reads as input. You can also choose to work in batch.

When the sequences are selected, click Next to specify what type of protocol you have used (directional or not).

A directional protocol yields reads from both strands of the original sample-DNA. The first read of a pair (or every read for single-end sequencing) is known to be sequenced either from the original-top (OT) or the original-bottom (OB) strand. The second read of a pair is sequenced from a complementary strand, either ctOT or ctOB. At the time of writing, examples of directional protocols include:

In a non-directional protocol, the first read of a pair may come from any of the four strands: OT, OB, ctOT or ctOB. Examples include:

If you are uncertain about the directionality of your protocol, contact the protocol vendor. Note that it is sometimes possible to infer the directionality by looking at the reads: in the absence of methylation, a directional protocol will have few or no Cs in the first read of each pair. We do not recommend however using this approach.

Selecting the reference

When the sequences and directionality are selected, click Next, and you will see the dialog shown in figure 37.16.

Image referenceassembly_bs2
Figure 37.16: Specifying the reference sequences and masking.

Click the Browse and select element (Image browse) button to select either single sequences, a list of sequences or a sequence track as reference. Note the following constraints:

Reference masking

The next part of the dialog lets you mask the reference. Masking means that selected regions of the reference are ignored during read mapping. Reads will not be mapped to these regions, but the full reference is still included in the output.

Masking can be useful when reads are expected to originate only from specific regions, for example when working with targeted sequencing data. However, masking should be used with care. If reads originate outside the selected regions, they may be mapped to less suitable locations, which can affect downstream analyses such as variant detection.

Masking large numbers of regions, such as repetitive sequences, is generally not recommended. Repeats are handled automatically during mapping, and masking them may reduce performance and lead to incorrect read placement.

To mask a reference using regions defined in a masking track, choose:

Then click the Browse (Image browse) button to select a track for masking.

If your regions are stored as sequence annotations, they can be converted to a track.

Mapping parameters

Clicking Next leads to the parameters for the read mapping (see figure 37.17).

Image referenceassembly_bs3
Figure 37.17: Setting parameters for the bisulfite read mapping.

The first parameter allows the mismatch cost to be adjusted:

After setting the mismatch cost you need to choose between linear gap cost and affine gap cost, and depending on the model you chose, you need to set two different sets of parameters that control how gaps in the read mapping are penalized.

The score of a match between the read and the reference is usually set to 1. Adjusting the cost parameters above can improve the mapping quality, e.g. when the read error rate is high or the reference is expected to differ significantly from the sequenced organism. For example, if the reads are expected to contain many insertions and/or deletions, it can be a good idea to lower the insertion and deletion costs to allow more of such errors. However, one should also consider the possible drawbacks when adjusting these settings. For example, reducing the insertion and deletion cost increases the risk of mapping reads to the wrong positions in the reference.

Image mapper_unaligned_end
Figure 37.18: An alignment of a read where a region of 35bp at the start of the read is unaligned while the remaining 57 nucleotides matches the reference.

Figure 37.18 shows an example using linear gap cost where the read mapper is unable to map a region in a read due to insertions in the read and mismatches between the read and the reference. The aligned region of the read has a total of 57 matching nucleotides which result in an alignment score of 57 which is optimal when using the default cost for mismatches and insertions/deletions (2 and 3 respectively). If the mapper had aligned the remaining 35bp of the read as shown in figure 37.19 using the default scoring scheme, the score would become: (26+1+3+57)*1 - 5*2 - 8*3 = 53

In this case, the alignment shown in Figure 37.18 is optimal since it has the highest score. However, if either the cost of deletions or mismatches were reduced by one, the score of the alignment shown in figure 37.19 would become 61 and 58, respectively, and thus make it optimal.

Image mapper_aligned_end
Figure 37.19: An alignment of a read containing a region with several mismatches and deletions. By reducing the default cost of either mismatches or deletions the read mapper can make an alignment that spans the full length of the read.

Once the optimal alignment of the read is found, based on the cost parameters described above, a filtering process determines whether this match is good enough for the read to be included in the output. The filtering threshold is determined by two factors:

Mapping paired reads

CLC Genomics Workbench offers as the default choice to automatically calculate the distance between the pairs. If this is selected, the distance is estimated in the following way:

  1. A sample of 200,000 reads is extracted randomly from the full data set and mapped against the reference using a very wide distance interval.
  2. The distribution of distances between the paired reads is analyzed using a method that investigates the shape of the distribution and finds the boundaries of the peak.
  3. The full sample is mapped using this distance interval.
  4. The history (Image history_16_n_p) of the result records the distance interval used.

The above procedure will be run for each sequence list used as input, assuming that they do not necessarily share the same library preparation and could have different distributions of paired distances. Figure 37.21 shows an example of the distribution of intervals with and without automatic pair distance interval estimation.

Image before_and_after_pair_estimation
Figure 37.21: To the left: mapping with a narrower distance interval estimated by the workbench. To the right: mapping with a large paired distance interval (note the large right tail of the distribution).

Sometimes the automatic estimation of the distance between the pairs may return a warning "Few reads mapped as pairs so pair distance might not be accurate". This message indicates that the paired distance was chosen to spans all uniquely mapped reads. If in doubt, you may want to disable the option to automatically estimate paired distances and instead manually specify minimum and maximum distances between pairs on the input sequence list.

If the automatic detection of paired distances is not checked, the mapper will use the information about minimum and maximum distance recorded on the input sequence lists.

When a paired distance interval is set, the following approach is used for determining the placement of read pairs:

Non-specific matches

At the bottom of the dialog, you can specify how Non-specific matches should be treated. The concept of Non-specific matches refers to a situation where a read aligns at more than one position with an equally good score. In this case you have two options:

Note that a read is only considered non-specific when the read matches equally well at several alignment positions. If there are e.g. two possible alignment positions and one of them is a perfect match and the other involves a mismatch, the read is placed at the position with the perfect match and it is not marked as a non-specific match.

For paired data, reads are only considered non-specific matches if the entire pair could be mapped elsewhere with equal scores for both reads, or if the pair is broken in which case a read can be categorized as non-specific in the same way as single reads (see Mapping paired reads).

When looking at the mapping, the default color for non-specific matches is yellow.

Gap placement

In the case of insertions or deletions in homopolymeric or repetitive regions, the precise placement of the insertion or deletion cannot be determined from the data. An example is shown in figure 37.22.

Image gap_placement_65
Figure 37.22: Three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 5' end, but could have been placed towards the 3' end with an equally good mapping score for the read.

In this example, three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 5' end (left side), but could have been placed towards the 3' end with an equally good mapping score for the read as shown in figure 37.23.

Image gap_placement_60
Figure 37.23: Three A's in the reference (top) have been replaced by two A's in the reads (shown in red). The gap is placed towards the 3' end, but could have been placed towards the 5' end with an equally good mapping score for the read.

Since either way of placing the gap is arbitrary, the goal of the mapper is to place the gaps consistently at the same side for all reads.

Bisulfite read mapping result handling

()

Click Next lets you choose how the output of the mapping should be reported. There are two independent output options available that can be (de-)activated in both cases:

The main output is a reads track. A reads track is very "lean" (i.e. with respect to memory requirements) since it only contains the reads themselves. Additional information about the reference, consensus sequence or annotations can be added and viewed alongside in the context of a Track List later (by adding, for example, a reference and/or annotation track, respectively). This kind of output is useful when working with tracks in general and especially for resequencing purposes this is recommended.

Note that the tool will output an empty read mapping and report when nothing mapped, and empty unmapped reads if everything mapped.

Image bisulfite-mapping
Figure 37.24: A typical directional shotgun BS-seq mapping, together with the base-level methylation calling feature track on top.

Figure 37.24 illustrates the view of a typical directional shotgun BS-seq mapping. As with any read mapping view, the color of the reads follows the usual CLC convention, that is green/red for forward/reverse reads, and dark/pale blue for paired reads. Independent of this orientation property, each read or read pair has an 'invisible' property indicating if it came from the original top (OT), or original bottom (OB) strand. However, if the BS-sequencing protocol is truly 100%-directional, the orientation in the mapping and the OT/OB origin of reads/read pairs will be concordant.

In this figure, blocks of reads from the original top strand are marked with squiggly brackets on the left, while the rest are from the original bottom strand. In a mapping, they can be distinguished by a pattern of highlighted mismatches to reference. OT reads will have mismatches predominantly on 'T', due to C->T conversion; OB reads will have a pattern of mismatches on 'A' symbols, corresponding to G->A conversion as can be seen on a reverse-complementary strand. When methyl-C occurs in a sample, there will be a match in the reads, instead of an expected mismatch.

In this figure, there were two positions where such events occurred, both on the original bottom strand, and both supported by a single read only. 'G' symbols in those reads are shown in red boxes. The reverse direction of an arrowhead on a base-level methylation track also reflects the OB-position of a methylation event.

Note also that it appears that there may be a G/T heterozygous SNP (C/A in the OB strand) in the second position. While such occurrences may lead to underestimation of true methylation levels of cytosine alleles in heterozygous SNPs, our current tool does not attempt to compensate for such eventualities.

To understand and interpret BS-sequencing and mapping better, it may be helpful to examine the position (marked with a red asterisk) in between the two detected methylation events. There appears to be an additional A/C heterozygous SNP with C's in reads from OT strand fully converted to T's, i.e., showing no evidence of methylation at that heterozygous position.