Calculating expression values from RNA-Seq

When the reference has been defined, click Next and you are presented with the dialog shown in figure 27.8.

Image mrna_seq_step3
Figure 27.8: Defining how expression values should be calculated.

These parameters determine the way expression values are counted. Some background information on how paired reads are handled is useful before describing the parameters.

Paired reads in RNA-Seq
The CLC Genomics Workbench supports the direct use of paired data for RNA-Seq. A combination of single reads and paired reads can also be used. There are three major advantages of using paired data:

You can read more about how paired data are imported and handled in General notes on handling paired data.

When counting the mapped reads to generate expression values, the CLC Genomics Workbench needs to decide how to handle paired reads. The standard behavior is to count fragments if two reads map as a pair, the pair is counted as one. If the pair is broken (because the reads map outside the estimated pair distance or map in wrong orientation), none of the reads are counted. The reasoning is that something is not right in this case, it could be that the transcripts are not represented correctly on the reference, or there are errors in the data. In general, more confidence is placed with an intact pair. If a combination of paired and single reads are used, "true" single reads will also count as one (the single reads that come from broken pairs will not count).

In some situations it may be too strict to disregard broken pairs. This could be in cases where there is a high degree of variation compared to the reference or where the reference lacks comprehensive transcript annotations. By checking the Count paired reads as two option, both intact and broken pairs are now counted as two. For the broken pairs, this means that each read is counted as one. Reads that are single reads as input are still counted as one. Note that this approach does not represent the abundance of fragments being sequenced correctly, since the two reads of a pair derive from the same fragment, whereas a fragment sequenced with single reads only give rise to one read.

When looking at the mappings, reads from broken pairs have a darker color than reads that are intact pairs or originally single reads.

Expression value
The expression values are created on two levels as two separate result files: one for genes and one for transcripts (if the "Genome annotated with genes and transcripts" is selected in figure 27.4). The content of the result files is described in RNA-Seq results.

The Expression value parameter describes how expression per gene or transcript can be defined in different ways on both levels:

Please note that all values are present in the output. The Expression value in this dialog is solely used to inform the Workbench about which expression value should be applied when using the result in downstream analysis.

For genes without annotated transcripts, the RPKM cannot be calculated since the total length of all exons is needed. By checking the Calculate RPKM for genes without transcripts, the length of the gene will be used in place of an "exon length". If the option is not checked, there will be no RPKM value reported for those genes.

Genes in Operons
It should be noted that genes located very close to each other, such as those in operon structures, can sometimes be assigned erroneous expression values. This is because if part of one RNA-seq read (or even one nucleotide) is mapped outside of the gene region, it is labelled as 'intergenic' and is not used in the calculation of the gene's expression value. This also holds true when part of one read maps straight across two different genes. Due to the structure of operons, where several genes are transcribed in the same mRNA transcript and are therefore located directly alongside each other, it is likely that some RNA-seq reads will map across the boundary of two different genes. In this case, the expression value of these genes will be underestimated, because only reads that are contained within one single gene are considered in the calculation of its expression value.



Footnotes

... variant.27.2
Note that the CLC Genomics Workbench only calculates the expression of the transcripts already annotated on the reference.