Clustering of features and samples
Hierarchical clustering clusters taxa by the similarity of their taxonomic profiles over the set of samples, and samples by the similarity of taxonomic composition over the set of features (taxa).Each clustering has a tree structure that is generated as follows:
- Letting each taxa or sample be a cluster.
- Calculating pairwise distances between all clusters
- Joining the two closest clusters into one new cluster.
- Iterating 2-3 times until there is only one cluster left (which contains all the taxa or samples).
In the resulting tree, the length of branches reflect the distance between clusters.
To create a heat map:
Toolbox | Microbial Genomics Module () | Metagenomics () | Abundance Analysis () | Create Heat Map for Abundance Table ()
Select an abundance table with two or more samples as input (i.e., an OTU table, a merged abundance table, or a functional profiling table) and click Next.
Specify a distance measure and a cluster linkage (figure 7.13). The distance measure is used to specify how distances between two taxa or samples should be calculated. The cluster linkage specifies how the distance between two clusters, each consisting of a number of taxa or samples, should be calculated. Learn more about how distances and clusters are calculated at http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Clustering_features_samples.html.
Figure 7.13: Select an abundance table.
After having selected the distance measure, set up the feature filtering options (figure 7.14).
Figure 7.14: Set filtering options.
Genomes usually contain too many features to allow for a meaningful visualization. Clustering hundreds of thousands of features is also very time consuming. We therefore recommend to reduce the number of features before clustering, using the filter options available:
- No filtering: Keeps all features.
- Fixed number of features:
- Fixed number of features: The given number of features with the highest coefficient of variation (the ratio of the standard deviation to the mean) are kept.
- Minimum counts in at least one sample: Only features with more than this number of counts in at least one sample will be taken into account. Notice that the counts are raw, un-normalized values.
- Abundance table: Specify a subset of an abundance table in case you only want to display the heat map for that particular subset. Note that creating the heat map from the subset abundance table directly can not ensure proper normalization of the data, and it is therefore recommended to use the original abundance table as input and filter using this option.
- Specify features: Keeps a set of features, as specified by plain text, i.e., a list of feature names. Any white-space characters, as well as "," and ";" are accepted as separators.