RNA-Seq normalization

Many RNA-Seq tools compare samples based on their read counts. This section provides a brief overview of the normalization used by these tools so as to make read counts from different samples more comparable to each other.

Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared. Two such normalizations are supported: TMM normalization, and Housekeeping gene normalization.

For all relevant tools included in the RNA-Seq folder, either the TMM normalization is automatically applied, or an option is provided to choose between TMM normalization and Housekeeping gene normalization.

Per-sample library size normalization produces a single number for each sample that can be used to weight the counts data from that sample. The tools Differential Expression for RNA-Seq and Differential Expression in Two Groups use this number in their statistical model: for sample $ i$, the library size normalization factor is the $ \mathrm{constant_i}$ described in The GLM model.

Other tools do not use a statistical model and instead generate normalized counts.

PCA for RNA-Seq, Create Sample Level Heat Map for RNA-Seq and Create Feature Level Heat Map for RNA-Seq automatically perform TMM normalization ([Robinson and Oshlack, 2010]), followed by logCPM and Z-Score normalization.

Create Expression Browser automatically perform TMM normalization, followed by logCPM.

Differential Expression for RNA-Seq and Differential Expression in Two Groups can perform TMM or housekeeping gene normalization.

TMM Normalization

For TMM normalization, a TMM factor is computed by comparing the samples against a reference sample. The reference is the sample that has the count-per-million upper quartile closest to the mean upper quartile.

TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed. Therefore, it is important not to make subsets of the count data before doing statistical analysis or visualization, as this can lead to differences being normalized away.

Housekeeping gene normalization

Housekeeping genes can either be specified directly, or the most suitable subset of a short list of genes can be selected using the GeNorm algorithm of [Vandesompele et al., 2002] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-7-research0034).

Once a set of housekeeping genes has been chosen, the normalization factor for a sample is the natural logarithm of the geometric mean of the expressions of the genes for that sample.

We recommend the use of housekeeping genes rather than TMM when working with Targeted RNA Panels, or in situations where the TMM assumption that most genes are not differentially expressed does not hold.

It is not possible to view housekeeping gene normalized expressions within the Workbench. However, these values are relatively easy to calculate. To do so:

As an example of this procedure, consider the following expressions, where HKG1 and HKG2 are housekeeping genes:


Gene Sample1 Sample2 Sample3
HKG1 1 3 2
HKG2 3 3 2
Gene1 8 6 2
Gene2 4 4 4
Gene3 1 0 1

The geometric mean of the housekeeping gene expressions is:


Sample1 Sample2 Sample3
$ \sqrt[2]{3}$ 3 2

The normalized expressions are then:


Gene Sample1 Sample2 Sample3
Gene1 4619 2000 1000
Gene2 2309 1333 2000
Gene3 577 0 500

logCPM

TMM-adjusted logCPM counts (similar to the EdgeR approach ([Robinson et al., 2010])) are calculated as follows:

  1. We add a prior to the raw counts. This prior is 1.0 per default, but is scaled based on the library size as scaled_prior = prior*library_size/average_library_size.
  2. The library sizes are also adjusted by adding a factor of 2.0 times the prior to them (for explanation, see https://support.bioconductor.org/p/76300/).
  3. The logCPM is now calculated as log2(adjusted_count * 1E6 / adjusted_library_size).

Z-Score normalization

Z-Score normalization performs a final cross-sample normalization. For each row (gene/transcript), a Gaussian normalization (Z-Score normalization) is applied: data is shifted and scaled so that the mean is zero, and the standard deviation one.