RNA-Seq normalization
Many tools in the RNA-Seq folder compare samples based on their read counts. This section provides a brief overview of the normalization used by these tools so as to make read counts from different samples more comparable to each other.
Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared. Two such normalizations are supported: TMM normalization, and Housekeeping gene normalization.
For all relevant tools included in the RNA-Seq folder, either the TMM normalization is automatically applied, or an option is provided to choose between TMM normalization and Housekeeping gene normalization.
Per-sample library size normalization produces a single number for each sample that can be used to weight the counts data from that sample. The tools Differential Expression for RNA-Seq and Differential Expression in Two Groups use this number in their statistical model: for sample , the library size normalization factor is the described in The GLM model.
Other tools, such as PCA for RNA-Seq, Create Heat Map for RNA-Seq, and Create Expression Browser do not have a statistical model. These tools therefore perform further transformations to generate normalized counts, such as logCPM and Z-Score normalization.
TMM Normalization
The following tools automatically perform library size normalization using the TMM (trimmed mean of M values) method of [Robinson and Oshlack, 2010] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25): PCA for RNA-Seq, Create Heat Map for RNA-Seq, and Create Expression Browser. Additionally, TMM normalization is an option in the tools Differential Expression for RNA-Seq and Differential Expression in Two Groups.
TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed. Therefore, it is important not to make subsets of the count data before doing statistical analysis or visualization, as this can lead to differences being normalized away.
Housekeeping gene normalization
Housekeeping gene normalization is available as an alternative to TMM normalization in the tools Differential Expression for RNA-Seq and Differential Expression in Two Groups.
Housekeeping genes can either be specified directly, or the most suitable subset of a short list of genes can be selected using the GeNorm algorithm of [Vandesompele et al., 2002] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-7-research0034).
Once a set of housekeeping genes has been chosen, the normalization factor for a sample is the natural logarithm of the geometric mean of the expressions of the genes for that sample.
We recommend the use of housekeeping genes rather than TMM when working with Targeted RNA Panels, or in situations where the TMM assumption that most genes are not differentially expressed does not hold.
It is not possible to view housekeeping gene normalized expressions within the Workbench. However, these values are relatively easy to calculate. To do so:
- Export the raw expression values. It is recommended to use the Create Expression Browser tool to create a single table of all the samples, then to export this to a table format, such as .xlsx or .csv choosing to Export table as currently shown. For more details on export options see Export of tables.
- For each sample, find the geometric mean of the house-keeping gene expressions.
- The housekeeping gene normalized expression of a gene is 1000 * raw expression / geometric mean. This factor of 1000 is arbitrary - any number can be used such that the normalized expressions have a size that is easy to work with.
As an example of this procedure, consider the following expressions, where HKG1 and HKG2 are housekeeping genes:
Gene | Sample1 | Sample2 | Sample3 |
HKG1 | 1 | 3 | 2 |
HKG2 | 3 | 3 | 2 |
Gene1 | 8 | 6 | 2 |
Gene2 | 4 | 4 | 4 |
Gene3 | 1 | 0 | 1 |
The geometric mean of the housekeeping gene expressions is:
Sample1 | Sample2 | Sample3 |
3 | 2 |
The normalized expressions are then:
Gene | Sample1 | Sample2 | Sample3 |
Gene1 | 4619 | 2000 | 1000 |
Gene2 | 2309 | 1333 | 2000 |
Gene3 | 577 | 0 | 500 |
logCPM
For the tools PCA for RNA-Seq, Create Heat Map for RNA-Seq, and Create Expression Browser, additional normalization is performed: after TMM factors are calculated for each sample, we calculate the TMM-adjusted log CPM counts (similar to the EdgeR approach [Robinson et al., 2010]):
- We add a prior to the raw counts. This prior is 1.0 per default, but is scaled based on the library size as
scaled_prior = prior*library_size/average_library_size
. - The library sizes are also adjusted by adding a factor of 2.0 times the prior to them (for explanation, see https://support.bioconductor.org/p/76300/).
- The logCPM is now calculated as
log2(adjusted_count * 1E6 / adjusted_library_size)
.
Z-Score normalization
For the tools PCA for RNA-Seq and Create Heat Map for RNA-Seq we perform a final cross-sample normalization. For each row (gene/transcript), a Gaussian normalization (Z-Score normalization) is applied: data is shifted and scaled so that the mean is zero, and the standard deviation one.