UMAP for Single Cell
Uniform Manifold Approximation and Projection, UMAP, is a general purpose algorithm for visualizing high dimensional data in 2D or 3D [McInnes et al., 2018]. In the CLC Single Cell Analysis Module, it is one of two ways of constructing a Dimensionality Reduction Plot (), with the other being tSNE. The choice between tSNE and UMAP is purely visual - it has no effect on downstream analysis. Therefore it is recommended to use the tool that produces the visualization you prefer.
The UMAP for Single Cell tool can be found in the Toolbox here:
Toolbox | Single Cell Analysis () | Dimensionality Reduction () | UMAP for Single Cell ()
The tool takes an Expression Matrix () / (), or a Peak Count Matrix (), or both types of matrix as input. Note that when both types of matrices are provided, only cells that are in common to both matrices are used.
UMAP for Single Cell offers options to run dimensionality reduction or feature selection prior to the UMAP algorithm. For details on these options, please see Feature selection and dimensionality reduction. The following additional options are available:
- Produce 3D plot. Perform a second UMAP calculation in three dimensions. As this involves a full re-calculation of UMAP in a higher dimension, the runtime is approximately doubled when this option is selected. Note that the 3D plot has special system requirements, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=System_requirements.html.
- Distance measure. UMAP works on a k-nearest neighbor graph, and the distance measure is used to find the `nearest' neighbors. The `1-Pearson correlation' distance is less sensitive to changes in the scale of expression between cells than Euclidean distance (for example, if one cell has exactly twice the expression of another for each gene, the `1 - Pearson correlation' distance is 0 while the Euclidean distance may be very large) and may be better at finding more distant neighbors. Conversely, Euclidean distance may provide higher resolution for distinguishing similar cell types.
- Neighborhood size. The number of cells `k' used in the k-nearest neighbor graph. This determines the granularity of the visualization. Smaller values will recover more local structure, but will lose the `big picture'. Larger values may average out fine structure.
- Random seed. The algorithm contains a random component determined by the seed. This means that each value of the seed leads to a slightly different visualization.
- Minimum distance. The effective minimum distance between embedded points. Smaller values result in tighter clusters.
- Spread. The effective scale of embedded points. Smaller values result in tighter clusters.
- Epochs. The algorithm works by repeating the same steps a predefined number of times. There is, unfortunately, no good rule for determining how many iterations are appropriate. More iterations do no harm, but too few iterations may lead to clusters of cells failing to separate. Doubling the number of iterations approximately doubles the runtime.
An example output is shown in figure 16.1.
Figure 16.1: A UMAP visualization of data from [MacParland et al., 2018].
Tuning the visualization
Although reducing Spread and Minimum distance give tighter clusters, they do so in different ways. Therefore it can be useful to try changing both parameters. An example of this is given in figures 16.2-16.4.
Figure 16.2: UMAP with Minimum distance = 0.3 and Spread = 1. This is the same plot as in figure 16.1, but with clusters overlaid.
Figure 16.3: UMAP with Minimum distance = 1 and Spread = 1. The overall structure of the clusters is the same as in figure 16.2, but the points are more separated.
Figure 16.4: UMAP with Minimum distance = 0.3 and Spread = 10. Both points and clusters are more separated than in figure 16.2. Whether this is desirable is likely to depend on the application. For example, it is easier to see that the dark blue, light blue and orange clusters are different cell types, which may help with cell type annotation, but their proximity in the other figures may have indicated a shared developmental lineage, which it is not possible to see here. Note that other clusters, such as the red cluster, are now also split in two, compared to the other figures.