Specifying reads and reference

To start the RNA-Seq analysis, go to:

        Toolbox | RNA-Seq Analysis (Image rna_seq_group_closed_16_n_p) | RNA-Seq Analysis (Image rnaseq)

This opens a dialog where you select the sequencing reads. Note that you need to import the sequencing data into the Workbench before it can be used for analysis. Importing read data is described in Import Sequencing Data.

If you have several samples that you wish to analyze independently and compare afterwards, you can run the analysis in batch mode.

Click Next when the sequencing data are listed in the right-hand side of the dialog.

You are now presented with the dialog shown in figure 28.4.

Image mrna_seq_step2-genomics
Figure 28.4: Defining a reference genome for RNA-Seq.

At the top, there are three options concerning how the reference sequences are annotated.


Tightly packed genes and genes in operons

For annotated references containing genes located very close to each other (including operon structures) only reads mapping completely within a gene's boundaries will be counted towards the expression value for that gene. If any part of a read maps outside a given gene's boundaries, then it will be considered intergenic and will not be counted towards the expression value. For tightly packed genes, especially in cases where non-coding 5' regions are not included in the gene annotation, this can be too conservative: if there are short genes, where the read length exceeds the gene length in some cases, then some granularity may be lost. That is, reads mapping to short genes might not be counted at all.

If this situation arises in your data, you can do the following:

  • Use the option "One reference per transcript" in the "Select reference" wizard, and input a list of transcript sequences instead of a track. A list of sequences can be generated from a mRNA track (or a gene track for bacteria if no mRNA track is available) using the Extract Annotations tool (see Extract Annotations).

  • In cases where the input reads are paired-end, choose the option "Count paired reads as two" in the Expression level options dialog. This will ensure that each read of the pair is counted towards the expression of the gene with which it overlaps, (by default, paired reads that map to different genes are not counted).

This strategy is equivalent to the option "Map to gene regions only (fast)" option that was available in the workbench released before February 2017.

At the bottom of the dialog you can choose between these two options:

If spike-ins have been used, the quality control results are shown in the output report. So when using spike-in, make sure that the option to output a report is checked.

To learn how to import spike-in control files, see Import RNA Spike-in.