QIAGEN Bioinformatics Manuals

Hypergeometric Tests on Annotations

The first approach to using annotations to extract biological information is the hypergeometric annotation test. This test measures the extent to which the annotation categories of features in a smaller gene list, 'A', are over or under-represented relative to those of the features in larger gene list 'B', of which 'A' is a sub-list. Gene list B is often the features of the full experiment, possibly with features which are thought to represent only noise, filtered away. Gene list A is a sub-experiment of the full experiment where most features have been filtered away and only those that seem of interest are kept. Typically gene list A will consist of a list of candidate differentially expressed genes. This could be the gene list obtained after carrying out a statistical analysis on the experiment, and choosing to keep only those features with FDR corrected p-values <0.05 and a fold change larger than 2 in absolute value. The hyper geometric test procedure implemented is similar to the unconditional GOstats test of [Falcon and Gentleman, 2007].

Tools | Microarray Analysis ()| Annotation Test () | Hypergeometric Tests on Annotations ()

This will show a dialog where you can select the two experiments - the larger experiment, e.g. the original experiment including the full list of features - and a sub-experiment (see how to create a sub-experiment in Creating sub-experiment from selection).

Click Next. This will display the dialog shown in figure 35.50.

Image hypergeometric_step2
Figure 35.50: Parameters for performing a hypergeometric test on annotations.

At the top, you select which annotation to use for testing. You can select from all the annotations available on the experiment, but it is of course only a few that are biologically relevant. Once you have selected an annotation, you will see the number of features carrying this annotation below.

Annotations are typically given at the gene level. Often a gene is represented by more than one feature in an experiment. If this is not taken into account it may lead to a biased result. The standard way to deal with this is to reduce the set of features considered, so that each gene is represented only once. In the next step, Remove duplicates, you can choose the basis on which the feature set will be reduced:

Using gene identifier.
Keep feature with:
- Highest IQR. The feature with the highest interquartile range (IQR) is kept.
- Highest value. The feature with the highest expression value is kept.

First you specify which annotation you want to use as gene identifier. Once you have selected this, you will see the number of features carrying this annotation below. Next you specify which feature you want to keep for each gene. This may be either the feature with the highest inter-quartile range or the highest value.

At the bottom, you can select which values to analyze (see Selecting transformed and normalized values for analysis). Only features that have a numerical value assigned to them will be used for the analysis. That is, any feature which has a value of plus infinity, minus infinity or NaN will not be included in the feature list taken into the test. Thus, the choice of value at this step can affect the features that are taken forward into the test in two ways:

If there are features with values of plus infinity, minus infinity or NaN, those features will not be taken forward into the test. This can be a consideration when choosing transformed values, where the mathematical manipulations involved may lead to such values.
If you chose to remove duplicates, then the value type you choose here is the value used for checking the highest IQR or value to determine which feature is taken forward into the test.

Click on Finish to launch the analysis.

The final number of features used for the test is reported in this history view of the test results.

Result of hypergeometric tests on annotations

The result of performing hypergeometric tests on annotations using GO biological process is shown in figure 35.51.

Image hypergeometric_result_new
Figure 35.51: The result of testing on GO biological process.

The table shows the following information:

Category. This is the identifier for the category.
Description. This is the description belonging to the category. Both of these are simply extracted from the annotations.
Full set. The number of features in the original experiment (not the subset) with this category. (Note that this is after removal of duplicates).
In subset. The number of features in the subset with this category. (Note that this is after removal of duplicates).
Expected in subset. The number of features we would have expected to find with this annotation category in the subset, if the subset was a random draw from the full set.
Observed - expected. 'In subset' - 'Expected in subset'
p-value. The tail probability of the hyper geometric distribution This is the value used for sorting the table.

Categories with small p-values are over-represented on the features in the subset relative to the full set.

GO terms are organized in a hierarchical structure. For example, the term "GO:0033151 V(D)J recombination" from the Gene Ontology [Ashburner et al., 2000,The Gene Ontology Consortium, 2019] (https://geneontology.org/) is a descendant of "GO:0006259 DNA metabolic process".

When testing for the significance of a particular GO term, all features linked to descendant GO terms are included in the test. This can lead to a higher number of detected genes in the output table, compared to the number of genes linked to the tested GO term.

Due to the hierarchical structure, GO terms are not independent of one another, and the p-values provided in the table should be interpreted with caution.

Browse the manual

Hypergeometric Tests on Annotations

Result of hypergeometric tests on annotations