CLC Manuals - clcsupport.com

GO enrichment analysis

This tool can be used to investigate candidate variants or better their corresponding altered genes for a common functional role. For example if you would like to know what is interesting in the zebu cattle in comparison to bison and taurine cattle, you can use this tool. For that approach, first filter all found variants in zebu for zebu-specific variants and afterwards run the GO enrichment test for biological process to see that more variants than expected are in immune response genes. These can then be further investigated.

For this, you need a GO association file, which includes gene names and associated Gene Ontology terms. You can download that from the Gene Ontology web site for different species (http://www.geneontology.org/GO.downloads.annotations.shtml). Find Bos taurus on the list and double-click on "annotations" (see figure 26.53). Import the downloaded annotations into the CLC Genomics Workbench using "Standard Import".

Image GO_download_GOA
Figure 26.53: Download the GO Annotations from Bos taurus by double-clicking on "annotations" and import the downloaded annotations into the CLC Genomics Workbench using "Standard Import".

However, it is better to use a file with only the top-level GO terms annotated (GO slim). For some species you can get that directly or you can create your own via the QuickGO tool (http://www.ebi.ac.uk/QuickGO/GMultiTerm).

To run the analysis go to the toolbox:

Toolbox | Resequencing Analysis () | Functional Consequences | GO Enrichment Analysis

When you run the GO Enrichment Analysis, you have to specify both the annotation association file, a gene track and finally which ontology (cellular component, biological process or molecular function) you would like to test for (see figure 26.54).

Image GO_enrichment_step2
Figure 26.54: The GO enrichment settings.

The analysis starts by associating all of the variants from the input variant file with genes in the gene track, based on overlap with the gene annotations. A variant track can be created with the CLC Genomics Workbench variant callers (Quality-based variant detection, InDels and Structural Variation and Probabilistic variant detection).

Next, the Workbench tries to match gene names from the gene (annotation) track with the gene names in the GO association file. A gene (annotation) track can be created (see ). Please be aware that the same gene name definition should be used in both files.

Based on this, the Workbench finds GO terms that are over-represented in the list. A hypergeometric test is used to identify over-represented GO terms by testing whether some of the GO terms are over-represented in a given gene set, compared to a randomly selected set of genes.

The result is a table with GO terms and the calculated p-value for the candidate variants, and a new variant file with annotated GO terms and the corresponding p-value. The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, or in other words how significant (trustworthy) a result is. In case of a small p-value the probability of achieving the same result by chance with the same test statistic is very small.

Image GOEnrichmentResults
Figure 26.55: The GO enrichment results.

Browse the manual

GO enrichment analysis