Specifying reads and reference
To start the RNA-Seq analysis, go to:
Toolbox | Transcriptomics Analysis () | RNA-Seq Analysis ()
This opens a dialog where you select the sequencing reads. Note that you need to import the sequencing data into the Workbench before it can be used for analysis. Importing read data is described in Import Sequencing Data.
If you have several samples that you wish to analyze independently and compare afterwards, you can run the analysis in batch mode.
Click Next when the sequencing data are listed in the right-hand side of the dialog.
You are now presented with the dialog shown in figure 28.4.
Figure 28.4: Defining a reference genome for RNA-Seq.
At the top, there are three options concerning how the reference sequences are annotated.
- Genome annotated with genes and transcripts. This option is the only option where splicing is taken into account. When this option is selected, both a Gene and an mRNA track should be provided in the boxes below. The mRNA annotations are used to define how the transcripts are spliced (as shown in figure 28.1). The reference sequence, gene, and mRNA tracks are provided with the CLC Cancer Research Workbench and can be downloaded using the Data Management () function found in the top right corner of the Workbench (see Download and configure reference data).
- Genome annotated with genes only. This option should be used for in situations where you are not interested in transcript level expression. When this option is selected, a Gene track should be provided in the box below.
- One reference sequence per transcript. This option is suitable for situations where the reference is a list of sequences. Each sequence in the list will be treated as a "transcript" and expression values are calculated for each sequence. This option is most often used if the reference is a product of a de novo assembly of RNA-Seq data. When this option is selected, only the reference sequence should be provided, either as a sequence track or a sequence list.
- Map to gene regions only (fast). This option will ignore all inter-genic regions in the reference. Since only genes are considered, this options is also significantly faster than the alternative option. The effect of restricting the mapping to genes only is that any reads coming from genes or transcripts that are not part of the annotations will either be unmapped or map to another transcript with a similar sequence (e.g. a pseudo-gene). For poorly annotated references, it is possible to improve the annotations using the Transcript Discovery plugin which is freely available for download in the Plugin Manager (see Installing plugins).
- Also map to inter-genic regions. This option will include the inter-genic regions as well. Please note that reads that map outside genes are counted as intergenic hits only and thus do not contribute to the expression values28.1. If a read maps equally well to a gene and to an inter-genic region, the read will be placed in the gene.
Footnotes
- ... values28.1
- The reads will indirectly impact the RPKM expression values as they will be counted in the total number of mapped reads which is used to calculate RPKM (Definition of RPKM)