Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared. In the case of the tools included in the RNA-Seq folder, this normalization is automatically applied by the tools.
For the RNA-Seq tools that compare samples (PCA for RNA-Seq, Create Heat Map for RNA-Seq, Differential Expression for RNA-Seq and Create Expression Browser), library size normalization is automatically performed using the TMM (trimmed mean of M values) method of [Robinson and Oshlack, 2010] (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25). Libraries sizes are then used as part of the per-sample normalization. TMM normalization is the normalization used in edgeR [Robinson et al., 2010].
TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed. Therefore, it is important not to make subsets of the count data before doing statistical analysis or visualization, as this can lead to differences being normalized away.
After TMM factors are calculate for each sample, we calculate the TMM-adjusted log CPM counts (similar to the EdgeR approach):
- We add a prior to the raw counts. This prior is 1.0 per default, but is scaled based on the library size as
scaled_prior = prior*library_size/average_library_size.
- The library sizes are also adjusted by adding a factor of 2.0 times the prior to them (for explanation, see https://support.bioconductor.org/p/76300/).
- The logCPM is now calculated as
log2(adjusted_count * 1E6 / adjusted_library_size).
Finally, we perform cross-sample normalization. For each row (gene/transcript), a Gaussian normalization (Z-Score normalization) is applied: data is shifted so that mean is zero, and the standard deviation one.