Subsections


Expression settings

When the reference has been defined, click Next and you are presented with the dialog shown in figure 31.39.

Image mrna_seq_step3
Figure 31.42: Defining how expression values should be calculated.

These parameters determine the way expression values are counted.

Strand setting

If a strand specific protocol for read generation has been used, the user should choose the corresponding setting. This allows assignment of the reads to the right gene in cases where overlapping genes are located on different strands. Without the strand-specific protocol, this would not be possible (see [Parkhomchuk et al., 2009]). Note that when not running RNA-Seq with 'Both', only pairs in forward-reverse orientation are used, meaning that mate pairs are not supported.

Library type setting

Paired reads counting scheme

The CLC Genomics Workbench supports the direct use of paired data for RNA-Seq. A combination of single reads and paired reads can also be used. There are three major advantages of using paired data:

You can read more about how paired data are imported and handled in General notes on handling paired data.

When counting the mapped reads to generate expression values, the CLC Genomics Workbench needs to be told how to handle the counting of paired reads that map as

Single mapped reads are always given a count of one.

The default behavior of the CLC Genomics Workbench is to count fragments (FPKM) rather than individual reads for intact pairs. That is, an intact pair is given a count of one ('Count paired reads as two' is not checked). Neither member of a broken pair is counted ('Ignore broken pairs' is checked). The reasoning is that when reads map as a broken pair, it is an indication that something is not right. For example, perhaps the transcripts are not represented correctly on the reference or there are errors in the data. In general, more confidence can be placed on an intact pair representing transcription within the sample. If a combination of paired and single reads are input into the analysis, then single reads that map are given a count of one. This is different from reads input into the analysis as part of a pair, but where their partner did not map.

In some situations it may be too strict to disregard broken pairs. This could be the case where there is a high degree of variation in the sample compared to the reference or where the reference lacks comprehensive transcript annotations. By checking 'Count paired reads as two', mapped 'reads' (RPKM) rather than mapped 'fragments' (FPKM) are counted. The two reads in an intact pair are each counted as one mapped read (so an intact pair contributes with a total count of two), and mapped members of broken pairs will each get a count of one ('Ignore broken pairs' is disabled).

For specific protocols generating chimeric reads, it is desirable to count intact pairs as one ('Count paired reads as two' is not checked) and to also count the mapped members of broken pairs ('Ignore broken pairs' is not checked). When choosing this option, it is not possible to detect Gene fusions.

Note that if not using the default behavior, the value does not correctly represent the abundance of fragments being sequenced, since the two reads of a pair derive from the same fragment and can increase the value by two, whereas a fragment sequenced with single reads only give rise to one read and increases the value by one.

Regardless of the chosen option, the value will be given in a column called "RPKM" for all subsequent analysis.

Expression value

Please note that reads that map outside genes are counted as intergenic hits only and thus do not contribute to the expression values. If a read maps equally well to a gene and to an inter-genic region, the read will be placed in the gene.

The expression values are created on two levels as two separate result files: one for genes and one for transcripts (if the "Genome annotated with genes and transcripts" is selected in figure 31.35). The content of the result files is described in RNA-Seq results.

The Expression value parameter describes how expression per gene or transcript can be defined in different ways on both levels:

Please note that all values are present in the output. The choice of expression value only affects how Expression Tracks are visualized in the track view but the results will not be affected by this choice as the most appropriate expression value is automatically selected for the analysis being performed: for detection of differential expression this is the "Total counts" value, and for the other tools this is a normalized and transformed version of the "Total counts" as described below.

Calculate expression for genes without transcripts

For genes without annotated transcripts, the RPKM cannot be calculated since the total length of all exons is needed. By checking the Calculate expression for genes without transcripts, the length of the gene will be used in place of an "exon length". If the option is not checked, there will be no RPKM value reported for those genes.

Definition of RPKM RPKM, Reads Per Kilobase of exon model per Million mapped reads, is defined in this way [Mortazavi et al., 2008]:

$\displaystyle \emph{RPKM} = \frac{\emph{total exon reads}}{\emph{mapped reads(millions)} \times \emph{exon length (KB)}}. $

For prokaryotic genes and other non-exon based regions, the calculation is performed in this way:

$\displaystyle \emph{RPKM} = \frac{\emph{total gene reads}}{\emph{mapped reads(millions)} \times \emph{gene length (KB)}}. $

Total exon reads
This value can be found in the column with header Total exon reads in the expression track. This is the number of reads that have been mapped to exons (either within an exon or at the exon junction). When the reference genome is annotated with gene and transcript annotations, the mRNA track defines the exons, and the total exon reads are the reads mapped to all transcripts for that gene. When only genes are used, each gene in the gene track is considered an exon. When an un-annotated sequence list is used, each sequence is considered an exon.
Exon length
This is the number in the column with the header Exon length in the expression track, divided by 1000. This is calculated as the sum of the lengths of all exons (see definition of exon above). Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Partly overlapping exons will count with their full length, even though they share the same region.
Mapped reads
The sum of all mapped reads as listed in the RNA-Seq analysis report. If paired reads were used in the mapping, mapped fragments are counted here instead of reads, unless the Count paired reads as two option was selected. For more information on how expression is calculated in this case, see Calculating expression values from RNA-Seq.

Output settings

Click Next and you are presented with the dialog shown in figure 31.40. The parameter is enabled when using paired data.

Image mrna_seq_step3b
Figure 31.43: Fusion gene table settings.

The Minimum read count fusion gene table parameter ensures that only combinations of genes supported by at least this number of read pairs are included. The default value is 5, which means that at least 5 pairs need to connect two genes in order to report it in the result (see Gene fusion reporting).



Footnotes

... variant.31.1
Note that the CLC Genomics Workbench only calculates the expression of the transcripts already annotated on the reference.