QIAGEN Bioinformatics Manuals

Hierarchical Clustering of Samples

A hierarchical clustering of samples is a tree representation of their relative similarity.

The tree structure is generated by

letting each sample be a cluster
calculating pairwise distances between all clusters
joining the two closest clusters into one new cluster
iterating 2-3 until there is only one cluster left (which will contain all samples).

The tree is drawn so that the distances between clusters are reflected by the lengths of the branches in the tree. Thus, features with expression profiles that closely resemble each other have short distances between them, those that are more different, are placed further apart.

(See [Eisen et al., 1998] for a classical example of application of a hierarchical clustering algorithm in microarray analysis. The example is on features rather than samples).

To start the clustering:

Tools | Expression Analysis ()| Quality Control () | Hierarchical Clustering of Samples ()

Select a number of samples ( () or ()) or an experiment () and click Next.

This will display a dialog as shown in figure 26.32. The hierarchical clustering algorithm requires that you specify a distance measure and a cluster linkage. The similarity measure is used to specify how distances between two samples should be calculated. The cluster distance metric specifies how you want the distance between two clusters, each consisting of a number of samples, to be calculated.

Image sample_clustering_step2
Figure 26.32: Parameters for hierarchical clustering of samples.

There are three kinds of distance measures:

Euclidean distance. The length of the segment connecting two points. If $u=(u_1,u_2,\dots, u_n)$ and $v=(v_1,v_2,\dots, v_n)$ , then the Euclidean distance between and is

$\displaystyle \vert u-v\vert = \sqrt{\sum_{i=1}^n (u_i-v_i)^2}.$
Manhattan distance. The distance between two points measured along axes at right angles. If $u=(u_1,u_2,\dots, u_n)$ and $v=(v_1,v_2,\dots, v_n)$ , then the Manhattan distance between and is

$\displaystyle \vert u-v\vert = \sum_{i=1}^n \vert u_i-v_i\vert.$
1 - Pearson correlation. The Pearson correlation coefficient between and is defined as

$\displaystyle r = \frac{1}{n-1}\sum_{i=1}^n \left( \frac{x_i-\overline{x}}{s_x} \right) \cdot \left( \frac{y_i-\overline{y}}{s_y} \right)$
where $\overline{x}/\overline{y}$ and are the average and sample standard deviation, respectively, of the values in values.
The Pearson correlation coefficient ranges from -1 to 1, with high absolute values indicating strong correlation, and values near 0 suggesting little to no relationship between the elements.
Using 1 - | Pearson correlation | as the distance measure ensures that highly correlated elements have a shorter distance, while elements with low correlation are farther apart.

The distance between two clusters is determined using one of the following linkage types:

Single linkage. The distance between the two closest elements in the two clusters.
Average linkage. The average distance between elements in the first cluster and elements in the second cluster.
Complete linkage. The distance between the two farthest elements in the two clusters.

At the bottom, you can select which values to cluster (see Selecting transformed and normalized values for analysis).

Click on Finish to launch the analysis.

Note: To be run on a server, the tool has to be included in a workflow, and the results will be displayed in a a stand-alone new heat map rather than added into the input experiment table.

Subsections

Result of hierarchical clustering of samples

Browse the manual

Hierarchical Clustering of Samples