Extract Reads

The Extract Reads tool extracts reads from read mappings, providing the extracted reads either in a track-based read mapping (reads track) or in a sequence list. Either all reads or a subset of reads can be extracted. Subsets can be specified based on location relative to a set of annotations and/or based on specified characteristics of the reads.

To launch Extract Reads, go to:

        Toolbox | Utility Tools (Image utilities_closed_16_n_p) | Extract Reads (Image filter_overlapping_annotations_16_n_p)

Select a stand-alone read mapping or reads track as input (figure 35.5). To run the tool on multiple mappings, check the Batch checkbox.

Image extractreads_based_on_overlaps_step2
Figure 35.5: Select a read mapping as input. Here, a reads track has been selected.

The next wizard step (figure 35.6) is relevant if you wish to extract reads based on their location relative to a set of annotations or locations in an RNA-Seq statistical comparison track. When one or more tracks are provided, only reads that overlap one or more of the annotated regions will be extracted.

The reference genome for mappings and annotation tracks must be compatible to use this functionality. This means they contain the same number of chromosomes of corresponding lengths. This functionality can be used even if stand-alone read mappings were input, as long as the reference sequences of the mapping and overlap tracks are compatible.

Check the option "Only include reads within the intervals" if only reads that are fully within an annotated region should be extracted. That is, reads overlapping boundaries of these regions should not be included. The effect of enabling this options is illustrated in figure 35.7.

Image extractreads_based_on_overlaps_step3
Figure 35.6: Specify overlap tracks to only extract reads mapped to particular areas of the reference genome.

Image extractreads_based_on_overlaps_output
Figure 35.7: A track list illustrating the effect of including only reads fully within annotated intervals, or including these as well as those that overlap the boundaries. Top: The read mapping used as input. Middle: The output when "Only include reads within intervals" was selected. Bottom: The output when "Only include reads within intervals" was left unchecked.

In the next wizard step, the nature of the reads to extract can be specified. All the options are enabled by default (figure 35.8).

Image extractreads_based_on_overlaps_step1
Figure 35.8: Options to include or exclude specific types of reads.

Match specificity
  • Include specific matches Reads that mapped best to just a single position of the reference genome.
  • Include non-specific matches Reads that have multiple equally good alignments to the reference. These are the reads colored yellow by default in read mappings.

Alignment quality
  • Include perfectly aligned reads Reads where the full read is perfectly aligned to the reference genome (or consensus sequence for de novo assemblies). Reads that extend beyond the end of a contig are not considered perfectly aligned because part of the read does not match the reference.
  • Include reads with less than perfect alignment Reads with mismatches, insertions or deletions, or with unaligned ends.

Spliced status
  • Include spliced reads Reads mapped across an intron.
  • Include non spliced reads Reads not mapped across an intron.

Paired status
  • Include intact paired reads Paired reads mapped within the paired distance specified.
  • Include reads from broken pairs Paired reads where only one of the reads mapped, either because only one read in the pair matched the reference, or because the distance or relative orientation of its mate was wrong.
  • Include single reads Reads marked as single reads (as opposed to paired reads). Reads from broken pairs are not included in this category. Reads marked as single reads after trimming paired sequence lists are included in this category.
  • Only include matching read(s) of read pairs If only the forward or reverse read of a read pair matches the criteria, then only include the matching read as a broken pair. For example if the forward read is inside the overlap region but the reverse read does not, then this option only includes the forward read as a broken read. When both forward and reverse reads are inside the overlap region then the full paired read is included. Note that some tools ignore broken reads by default.

In the last wizard step, the output type is selected. The options are to output reads tracks or sequence lists.

Reads in read mappings are colored according to their characteristics. The default color scheme is described in Coloring of mapped reads.