Specifying reads and reference
To start the RNA-Seq analysis, go to:
Toolbox | RNA-Seq Analysis () | RNA-Seq Analysis ()
This opens a dialog where you select the sequencing reads. Note that you need to import the sequencing data into the Workbench before it can be used for analysis. Importing read data is described in Import Sequencing Data.
If you have several samples that you wish to analyze independently and compare afterwards, you can run the analysis in batch mode.
Click Next when the sequencing data are listed in the right-hand side of the dialog.
You are now presented with the dialog shown in figure 29.4.
Figure 29.4: Defining a reference genome for RNA-Seq.
At the top, there are three options concerning how the reference sequences are annotated.
- Genome annotated with genes and transcripts. This option is the only option where splicing is taken into account. When this option is selected, both a Gene and an mRNA track should be provided in the boxes below. The mRNA annotations are used to define how the transcripts are spliced (as shown in figure 29.2). The reference sequence, gene, and mRNA tracks are provided with the Biomedical Genomics Workbench and can be downloaded using the Data Management () function found in the top right corner of the Workbench (see Download and configure reference data).
When using this option, Expression values, RPKM and TPM are calculated based on the lengths of the transcripts provided by the mRNA track. If a gene's transcript annotation is absent from the mRNA track, all values will be set to 0 unless the option "Calculate expression for genes without transcript" is checked in a later dialog.
- Genome annotated with genes only. This option should be used for in situations where you are not interested in transcript level expression. When this option is selected, a Gene track should be provided in the box below.
When using this option, Expression values, RPKM and TPM are calculated based on the lengths of the genes provided by the Genes track.
- One reference sequence per transcript. This option is suitable for situations where the reference is a list of sequences. Each sequence in the list will be treated as a "transcript" and expression values are calculated for each sequence. This option is most often used if the reference is a product of a de novo assembly of RNA-Seq data. When this option is selected, only the reference sequence should be provided, either as a sequence track or a sequence list. Expression values, RPKM and TPM are calculated based on the lengths of sequences from the sequence track or sequence list.
At the bottom of the dialog you can choose between these two options:
- Do not use spike-in controls.
- Use spike-in controls. In this case, you can provide a spike-in control file in the field situated at the bottom of the dialog window. Make sure you remember to check the option to output a report in the last wizard step, as the report is the only place where the spike-in controls results will be available. During analysis, the spike-in data is added to the references. However, all traces of having used spike-ins are removed from the output tracks.
If spike-ins have been used, the quality control results are shown in the output report. So when using spike-in, make sure that the option to output a report is checked.
To learn how to import spike-in control files, see Import RNA Spike-in.