The Transcript Discovery algorithm

The Transcript discovery tool takes Large Gap Read Mappings tracks and optionally gene and transcript annotations as input, and produces gene and transcript track annotations as output. The annotations are generated by examining the read mapping and identifying likely regions of genes, their exons and splice sites, and, for each gene region, a set of transcript annotations that explain the observed exons and splice sites in this region.

Events

Alignments from a read mapping are by default filtered away if they are multi-matches, or duplicates. The definition of a duplicate match is that the start and end positions of the match and its sequence are identical to a previous match. Each non-filtered alignment in a read mapping is then converted to one or two events.

An event can be defined as information about the location of splice junctions obtained from e.g. a single read, two overlapping reads, or a previously annotated transcripts when annotation tracks are supplied during the tool set up. Events can be described as:

For short read sequencing data, events are most commonly unspliced, in which case they will have a single interval region with start and end positions. For these unspliced events, the start and end position are only upper and lower bounds on the location of a splice junction.

In case a spliced event is found, all spliced reads are realigned around the splice site. The realignment realigns 15bp of read before and after the splice junction (2 x 15 = 30bp in total) against a 60bp region of the reference. Note that the algorithm:

After realignment, there may be 0, 1 or many possible ways of placing the splice junction:

The strandedness of the transcript (whether it is forward or reverse on the reference genome) that a given event supports can be determined from the splice signatures of the event. This allows the noise to be filtered by discarding events that support a mixture of plus and minus strands. For example, two splice junctions from a single-end read may disagree on the strand, or the splice junctions of each read in a pair may disagree.

There is special handling for non-overlapping paired end reads. In these cases each read is separately turned into an event as described above and information about the strand is shared between the two events. If one event supports the plus strand, but the other does not allow the strand to be determined, then we infer that the other is also on the plus strand. If the two events support different strands, both are discarded.

Regions

The parameter "Gene merging distance" defines regions in the mappings that contain events close enough to belong to the same gene. Regions of coverage are identified by iterating through events while calculating the distance between the current and the next event. If this is less than the "Gene merging distance" value, the current region is extended. If not, a new region is started.

First round of merging events

The algorithm merges events that unequivocally support the same splice sites. Two events are strictly overlapping if

  1. the events overlap
  2. all introns and exons of the events in the overlapping region are the same
  3. the non-overlapping parts of the events do not extend across any splice site positions of any other events in the coverage region. This last requirement ensures that we do not merge an event with another event in cases where there are more events supporting different splice sites with which it could be merged.
Two events can only be merged if they are on the same strand, share exactly the same splice boundaries, and disagree with all other events about the location of exactly the same splice boundaries. This also guarantees that it is not possible to merge a spliced event with an unspliced event. The resulting merged event has a region that is the union of the two input events. Events cannot be merged if they overlap different splice boundaries on unknown strand.

Correct end- and splice-points

Many events will have end regions that are slightly off, mostly because the first few bases of the intron and the start of the following exon are identical (leading to exonic sequence being accidentally mapped to the intron), or because the events really should have been mapped with a gap but the read was too short on one side of the gap for the mapper to discover this. To correct this we identify start and end open positions whose bounds might be due to over-alignment. For example, if an exon starts at position <30, but a splice site is known at position 35, then the exon start will be corrected to <35. Correction is only applied to open positions within 8bp of a splice junction whose position is known exactly. Additionally, events with uncertain positions (available only if the option "Ignore reads with unknown splice signatures" is checked off) are corrected if their range of uncertainty overlaps a splice junction whose position is known exactly.

Regions on a minus strand are preferentially corrected by known splice sites on that strand. They are never corrected by splice sites on the plus strand. A similar statement holds for the plus strand. Regions whose strand is unknown are corrected twice: once assuming they are on a plus strand and once assuming they are on minus strand. The correction with most evidence is returned on the given strand.

Correction is careful to distinguish between the splice junction boundaries at the start and end of an intron. For example, if an exon starts at position <30, it can be corrected if an intron spans the range (35,50), but not if the intron spans the range (0,35). This is because the first case could arise due to over-alignment, but the second case cannot. Similarly, uncertain positions (only defined by a range) may only be corrected by splice junction boundaries that are at the same side of an intron as they are, or if their range of uncertainty overlaps a splice junction whose position is known exactly.

Merge/Filter cascade

The fixing of end and splice points above alters the set of events within a coverage region, and typically causes events that could not be merged previously to now be mergeable. To further reduce the number of events, we carry out a second merging of events that unequivocally support the same splice sites.

Afterwards, a series of hard filters is applied to events. The order of the filters is important, because after one filter has been applied, events may become mergeable, and the merged event may then pass a filter that the original events would fail.

Filters and merges happen in the following order:

  1. Ignore chimeric reads
  2. Ignore reads with unknown splice signatures
  3. Merge
  4. Minimum spliced reads
  5. Merge
  6. Assign to strand: Unspliced events with undefined strand are assigned to the plus or minus strand if events on that strand overlap in their exons, and do not overlap in their introns. If an event can be assigned to both plus and minus, it is kept as undefined.
  7. Merge
  8. Minimum unspliced reads. Note that since we have just performed a merge, removing these unspliced events does not result in changes that allow new merges to be made according to the merge requirements (3).

Split in genes

If gene annotations have been provided, they are used for the first time at this point in the process: Known Events (i.e., events made from previously annotated transcripts) are assigned to regions that correspond to the "known" gene to which they belong. These known genes/regions ultimately predict the genes to which the unknown events belong.

If, after this process, there are events left that have not been assigned to a gene, then they are assigned to new genes as follows:

  1. Assigning events with an undefined strand to a strand:
    • If all the events of known strand within the minimum distance between genes are on either plus or minus, then that strand is used.
    • If both plus or minus strand events lie within the minimum distance, the event is assigned to the strand with the event of most similar coverage to itself.
    • if there are no nearby event of known strand, but this event lies between the first and last events on the plus strand, and does not lie between the first and last events on the minus strand, then it is assigned to the plus strand (and vice versa).
    • Otherwise the event will be assigned to the strand supported by the most reads, or the minus strand in the event of a tie.
  2. Splitting events by strand and coverage: The events are split into two regions, one for each strand. Each of these is again split up if possible.

A second filter cascade

A series of hard filters is again applied to events. Three of these remove events based on the average coverage of the regions that contain them.

  1. Minimum spliced coverage ratio (default set at 0.05)
  2. Merge
  3. Minimum unspliced coverage ratio (default set at 0.05)
  4. Filtering of unspliced events that are low coverage (defined as <0.25 x maximum coverage of any event in their region) and that lie strictly before/after the first/last spliced event in that region. By strictly before/after, we mean that the event being filtered ends before/starts after the spliced event.
  5. Removal of unspliced events that do not overlap with the exons of any spliced event, and lie completely within the intron of one or more spliced events.

From events and regions to genes

Events are joined in a so-called 'Splice Graph' where each path through the graph combines a series of events into a possible isoform. The algorithm enumerates up to 10 000 possible isoforms from each graph, which usually exhausts all possibilities. The EM algorithm from the RNA-Seq Analysis tool is then used to determine the set of isoforms and their abundances that best explain the observed reads (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=_EM_estimation_algorithm.html).

Now that events have been associated to genes, genes (and their associated transcripts) are filtered according the following configurable criteria:

Transcripts with abundance that is not greater than 0.05 are additionally removed as likely noise. The abundances are not renormalized, to be able to distinguish between genes that have a few transcripts with high abundance, and genes with many transcripts with lower abundance and a lot of other noise transcripts. The abundance is reported as the relative confidence, see Tracks.

The Transcript Discovery tool returns a gene track that contains all "known" genes (i.e. those supplied as input) plus any "unknown" genes (i.e. those that we have predicted). Similarly the transcript track contains all the "known" input transcripts plus any "unknown" transcripts.