Feature selection and PCA

Several tools provide options for feature selection by choosing highly variable genes (HVGs), and principal component analysis (PCA). These options speed up calculations and reduce noise by reducing the amount of data the tools use.

Highly variable genes (HVGs)

Not all genes are equally informative when clustering or visualizing cells. For example, housekeeping genes, whose expression levels are approximately constant across different cell types, are not informative for distinguishing between cell types. It is therefore often possible to get qualitatively the same results from an analysis by only using genes whose expression levels are highly variable across cells.

In order to use HVGs, data must first have been normalized by Normalize Single Cell Data. Use highly variable genes is not selected by default, but may be appropriate when speed is a priority, or when results using all genes appear unsatisfactory. The Number of highly variable genes to use must be specified. Values in the range $ 1000$-$ 5000$ are typically sufficient to capture most variation from most data sets. Setting this value too low may exclude genes that are weakly informative, such as those that have small fold changes in rare cell types.

When a tool is run, the log will contain estimates of the amount of signal and noise removed by choosing a certain number of HVGs (figure 6.1), which may help when choosing an appropriate value.

Image hvglog
Figure 6.1: An example of information provided in a tool log. Here, using $ 1000$ HVGs reduced the total amount of variation in the data. However, the majority of the removed variation was estimated to be noise ($ 75.1\%$ of the original variation) and only a small amount of signal was lost ( $ 16.7-15.8=0.9\%$ of the original variation). For more details on variation estimates, see Calculation of estimated biological variation.

Genes are selected to be HVGs according to the variance of their normalized values, from highest variance to lowest variance. Genes with variance $ \leq1$ are never selected, as this is consistent with random noise - this means that the number of HVGs used in an analysis may be lower than the number specified.

Note that using HVGs in one part of an analysis does not limit the number of genes available in downstream steps. For example, after constructing a visualization with HVGs, it is still possible to visualize the expressions of all genes.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) projects data into a lower dimensional space while preserving as much variation as possible. In most circumstances it is recommended to Use PCA as it provides a substantial increase in speed without affecting accuracy. Exceptions might include analysis of targeted expression data, where the expression of only a few hundred genes is measured.

Not all PCA dimensions are equal - the first dimension contains most of the variation and each subsequent dimension contains less of the variation than the previous one. For this reason, it often makes little difference to results whether PCA dimensions is set to $ 50$ or $ 500$, but large differences can be observed if too few PCA dimensions are used. Values in the range $ 20$-$ 50$ are suitable for most applications. If the data has been normalized by Normalize Single Cell Data, the log will contain estimates of the amount of biological variation in the data, which can be compared to the amount of variation captured by the chosen number of PCA dimensions (figure 6.2). For details on how biological variation is estimated, see Calculation of estimated biological variation.

Image pcalog
Figure 6.2: An example of information provided in a tool log. Here, using $ 20$ PCA dimensions captured $ 16.0\%$ of variation in the data. This is comparable with the estimated amount of biological variation in the data.

PCA is performed using an implementation of Algorithm 971 [Li et al., 2017]. This is an extremely fast and accurate algorithm for finding the first PCA dimensions, but its accuracy decreases for higher dimensions. For this reason, it is advised to keep the number of PCA dimensions small compared to the number of expressed genes.

When data have been normalized by Normalize Single Cell Data it is additionally possible to Automatically select PCA dimensions. This chooses a number of dimensions $ \leq50$ that contain the same amount of variation as the estimated biological variation. An example log is shown in figure 6.3.

Image pcaselectlog
Figure 6.3: An example of information provided in a tool log when selecting PCA dimensions automatically. Here, using $ 50$ PCA dimensions captured $ 49.4\%$ of variation in the data, which was lower than the estimated biological (i.e. non-noise) variation in the data. At least $ 262$ PCA dimensions are required to capture all $ 68.2\%$ of the variation estimated to be biological. However, the estimates are upper bounds and in practice $ 50$ dimensions is likely to be sufficient.

Combining HVGs and PCA

It is possible to use HVGs and PCA together. When this is done, HVGs are selected and then PCA is run only on the HVGs. Note that, because using HVGs already removes a lot of noise, the log may show that even a relatively large number of PCA dimensions is insufficient to capture all the estimated biological variation. It may be worth experimenting with increasing the number of PCA dimensions slightly to check whether this has an impact on the results.



Subsections