RNA-Seq normalization

Many tools in the RNA-Seq folder compare samples based on their read counts. This section provides a brief overview of the normalization used by these tools so as to make read counts from different samples more comparable to each other.

Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared. Two such normalizations are supported: TMM normalization, and Housekeeping gene normalization.

For all relevant tools included in the RNA-Seq folder, either the TMM normalization is automatically applied, or an option is provided to choose between TMM normalization and Housekeeping gene normalization.

Per-sample library size normalization produces a single number for each sample that can be used to weight the counts data from that sample. The tools Differential Expression for RNA-Seq and Differential Expression in Two Groups use this number in their statistical model: for sample $ i$, the library size normalization factor is the $ \mathrm{constant_i}$ described in The GLM model.

Other tools, such as PCA for RNA-Seq, Create Heat Map for RNA-Seq, and Create Expression Browser do not have a statistical model. These tools therefore perform further transformations to generate normalized counts, such as logCPM and Z-Score normalization.

TMM Normalization

The following tools automatically perform library size normalization using the TMM (trimmed mean of M values) method of [Robinson and Oshlack, 2010]:

Additionally, TMM normalization is an option in the following tools:

For TMM normalization, a TMM factor is computed by comparing the samples against a reference sample. The reference is the sample that has the count-per-million upper quartile closest to the mean upper quartile.

TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed. Therefore, it is important not to make subsets of the count data before doing statistical analysis or visualization, as this can lead to differences being normalized away.

Housekeeping gene normalization

Housekeeping gene normalization is available as an alternative to TMM normalization in the tools Differential Expression for RNA-Seq and Differential Expression in Two Groups.

Housekeeping genes can either be specified directly, or the most suitable subset of a short list of genes can be selected using the GeNorm algorithm of [Vandesompele et al., 2002] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-7-research0034).

Once a set of housekeeping genes has been chosen, the normalization factor for a sample is the natural logarithm of the geometric mean of the expressions of the genes for that sample.

We recommend the use of housekeeping genes rather than TMM when working with Targeted RNA Panels, or in situations where the TMM assumption that most genes are not differentially expressed does not hold.

It is not possible to view housekeeping gene normalized expressions within the Workbench. However, these values are relatively easy to calculate. To do so:

As an example of this procedure, consider the following expressions, where HKG1 and HKG2 are housekeeping genes:


Gene Sample1 Sample2 Sample3
HKG1 1 3 2
HKG2 3 3 2
Gene1 8 6 2
Gene2 4 4 4
Gene3 1 0 1

The geometric mean of the housekeeping gene expressions is:


Sample1 Sample2 Sample3
$ \sqrt[2]{3}$ 3 2

The normalized expressions are then:


Gene Sample1 Sample2 Sample3
Gene1 4619 2000 1000
Gene2 2309 1333 2000
Gene3 577 0 500

logCPM

For the tools PCA for RNA-Seq, Create Heat Map for RNA-Seq, and Create Expression Browser, additional normalization is performed: after TMM factors are calculated for each sample, we calculate the TMM-adjusted log CPM counts (similar to the EdgeR approach [Robinson et al., 2010]):

  1. We add a prior to the raw counts. This prior is 1.0 per default, but is scaled based on the library size as scaled_prior = prior*library_size/average_library_size.
  2. The library sizes are also adjusted by adding a factor of 2.0 times the prior to them (for explanation, see https://support.bioconductor.org/p/76300/).
  3. The logCPM is now calculated as log2(adjusted_count * 1E6 / adjusted_library_size).

Z-Score normalization

For the tools PCA for RNA-Seq and Create Heat Map for RNA-Seq we perform a final cross-sample normalization. For each row (gene/transcript), a Gaussian normalization (Z-Score normalization) is applied: data is shifted and scaled so that the mean is zero, and the standard deviation one.