Introduction
The Transcript Discovery plugin is designed to discover transcripts by mapping RNA-Seq sequencing reads to a genomic reference, allowing large gaps (for introns), followed by a transcript discovery process where transcripts are inferred from the read mappings. Note that the Transcript Discovery tool has been tested to work well with other alignment tools including STAR, TopHat2, GSNAP and HISAT2.
The detection of novel transcripts from short-read sequencing data is only possible with low precision and sensitivity. Therefore these tools are focused on improving existing annotations for non-model eukaryotic species, updating an annotation based on RNA-Seq data and/or generating transcript and gene tracks to serve as a common reference for differential expression analysis using the RNA-Seq Analysis tool.
Best practices
The proposed workflow for using the Transcript Discovery plugin in combination with the existing RNA-Seq tool in CLC Genomics Workbench is:
- Run the Large Gap Read Mapping tool using all your RNA-Seq reads and a genomic reference sequence.
- Run the Transcript Discovery tool on the resulting read mapping to predict transcripts and genes.
- Inspect the results and if necessary re-run the transcript discovery to refine the settings to produce the desired result.
- Use the Predicted gene and Predicted Transcript tracks in the existing RNA-Seq tool in the Workbench.
To run an experiment with multiple replicates and tissues, it is possible to supply several Large Gap Read Mappings at once to the tool. These are then processed as one data set. However, you should note that:
- Supplying multiple samples increases coverage, which typically leads to the detection of more low-expression genes.
- Supplying multiple samples where the transcriptome differs markedly between samples may lead to a loss of precision. For example, if Large Gap Read Mappings Track A supports transcript A, and Large Gap Read Mappings Track B supports transcript B, the algorithm may instead call transcript C - which might be a hybrid of A and B - because it does not understand that the reads it sees come from two samples.
- Running two samples sequentially in the order "Sample A" and "Sample B" will give a different set of transcripts than the one obtained when running them in the order "Sample B" and "Sample A".
- Running two samples sequentially in the order "Sample A" and "Sample B" will give a different set of transcripts than the one obtained when supplying "Sample A" and "Sample B" together.
For these reasons, we recommend to run all replicates of the same condition together, and to run different conditions sequentially.
For example, if you had 4 "leaf" samples and 4 "root" samples from a plant, then you should run the tool on all 4 "leaf" samples and provide the output transcript track as input for the next invocation of the tool with the 4 "root" samples. You should later remap the samples separately using RNA-Seq Analysis, and prune away any annotations that have little or no expression in all the conditions. Note that this pruning of annotations can be necessary if, for example, the "leaf" data does not support a long transcript so a short one is predicted. However the long transcript is unambiguously present in the "root" data. Revisiting the leaf data after the long transcript is known might show that the long transcript is a good fit here too. The original short transcript might then be pruned away.
Known limitations
The Transcript Discovery has the following known limitations:
- The Large Gap Read Mapping tool can only align reads when at least 10% of the read maps without splicing. This requirement means that reads spanning more than 10 exons are less likely to be mapped.
- Alternative transcript isoforms that are a strict subset of existing transcripts (i.e. they differ only by having TSS and TES at different positions/exons but share all intervening exons), cannot be distinguished. Only the longest transcript will be reported in these cases.
- Transcripts spanning the origin of circular chromosomes will be reported as two disconnected transcripts: one at the start of the chromosome and one at the end.
- If the predictions generated by the Transcript Discovery tool are supplied as annotations, and a new round of prediction is performed on the same input read mapping, then a small number of novel transcripts and genes will still be identified. This is because the set of known annotations can affect which events are filtered, and lead to small changes in the predicted genes and transcripts.
- When used with short read data, all tools that attempt to recover full length transcripts are likely to produce many false positives - typically at least 50% for human RNA-Seq data [Hayer et al., 2015].