RNA-Seq normalization
Many RNA-Seq tools compare samples based on their read counts. This section provides a brief overview of the normalization used by these tools so as to make read counts from different samples more comparable to each other.
Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared. Two such normalizations are supported: TMM normalization, and Housekeeping gene normalization.
For all relevant tools included in the RNA-Seq folder, either the TMM normalization is automatically applied, or an option is provided to choose between TMM normalization and Housekeeping gene normalization.
Per-sample library size normalization produces a single number for each sample that can be used to weight the counts data from that sample. The tools Differential Expression for RNA-Seq and Differential Expression in Two Groups use this number in their statistical model: for sample , the library size normalization factor is the described in The GLM model.
Other tools do not use a statistical model and instead generate normalized counts.
PCA for RNA-Seq, Create Sample Level Heat Map for RNA-Seq and Create Feature Level Heat Map for RNA-Seq automatically perform TMM normalization ([Robinson and Oshlack, 2010]), followed by logCPM and Z-Score normalization.
Create Expression Browser automatically perform TMM normalization, followed by logCPM.
Differential Expression for RNA-Seq and Differential Expression in Two Groups can perform TMM or housekeeping gene normalization.
TMM Normalization
For TMM normalization, a TMM factor is computed by comparing the samples against a reference sample. The reference is the sample that has the count-per-million upper quartile closest to the mean upper quartile.
TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed. Therefore, it is important not to make subsets of the count data before doing statistical analysis or visualization, as this can lead to differences being normalized away.
Housekeeping gene normalization
Housekeeping genes can either be specified directly, or the most suitable subset of a short list of genes can be selected using the GeNorm algorithm of [Vandesompele et al., 2002] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-7-research0034).
Once a set of housekeeping genes has been chosen, the normalization factor for a sample is the natural logarithm of the geometric mean of the expressions of the genes for that sample.
We recommend the use of housekeeping genes rather than TMM when working with Targeted RNA Panels, or in situations where the TMM assumption that most genes are not differentially expressed does not hold.
It is not possible to view housekeeping gene normalized expressions within the Workbench. However, these values are relatively easy to calculate. To do so:
- Export the raw expression values. It is recommended to use the Create Expression Browser tool to create a single table of all the samples, then to export this to a table format, such as .xlsx or .csv choosing to Export table as currently shown. For more details on export options see Export of tables.
- For each sample, find the geometric mean of the house-keeping gene expressions.
- The housekeeping gene normalized expression of a gene is 1000 * raw expression / geometric mean. This factor of 1000 is arbitrary - any number can be used such that the normalized expressions have a size that is easy to work with.
As an example of this procedure, consider the following expressions, where HKG1 and HKG2 are housekeeping genes:
Gene | Sample1 | Sample2 | Sample3 |
HKG1 | 1 | 3 | 2 |
HKG2 | 3 | 3 | 2 |
Gene1 | 8 | 6 | 2 |
Gene2 | 4 | 4 | 4 |
Gene3 | 1 | 0 | 1 |
The geometric mean of the housekeeping gene expressions is:
Sample1 | Sample2 | Sample3 |
3 | 2 |
The normalized expressions are then:
Gene | Sample1 | Sample2 | Sample3 |
Gene1 | 4619 | 2000 | 1000 |
Gene2 | 2309 | 1333 | 2000 |
Gene3 | 577 | 0 | 500 |
logCPM
TMM-adjusted logCPM counts (similar to the EdgeR approach ([Robinson et al., 2010])) are calculated as follows:
- We add a prior to the raw counts. This prior is 1.0 per default, but is scaled based on the library size as
scaled_prior = prior*library_size/average_library_size
. - The library sizes are also adjusted by adding a factor of 2.0 times the prior to them (for explanation, see https://support.bioconductor.org/p/76300/).
- The logCPM is now calculated as
log2(adjusted_count * 1E6 / adjusted_library_size)
.
Z-Score normalization
Z-Score normalization performs a final cross-sample normalization. For each row (gene/transcript), a Gaussian normalization (Z-Score normalization) is applied: data is shifted and scaled so that the mean is zero, and the standard deviation one.