Interpreting the output of Train Cell Type Classifier

Train Cell Type Classifier produces the following outputs:

The report has up to 4 sections depending on whether validation data or an existing classifier were provided.

Input data cell types

The input data are the matrix and clusters from which cells are added to the classifier. They are distinct from the validation data. The training data is the subset of the input data that is added to the classifier, and the data already present in the existing Cell Type Classifier, if used.

The first table in this section lists the cell types in the input that have the exact same name as a term in the QIAGEN Cell Ontology. The second table lists the remaining input cell types.

When both tables have entries, it is recommended to check for spelling mistakes or redundancy. For example, in figure 8.5, some cells are annotated by a spelling mistake of "T lymhpocytes", and others are annotated as "perithelial cells" - which is a synonym of the term "pericytes" from the first table. The classifier will have attempted to learn all four types separately, which will likely harm performance.

Image train_cell_types_report_input
Figure 8.5: The "Input data cell types" section of the report. In this case, section 1.2 contains cell types that are spelling mistakes and synonyms.

The tables have the following columns:

The predictions of cell types that are already in the classifier and do not have a very small number of cells (e.g. $ <20$), are likely to be more accurate than predictions of new cell types with few cells.

Validation data cell types

This section is only present when validation data is supplied to the tool.

Where possible, a performance assessment of the new classifier is made for each cell type in the validation data.

When no assessment is possible, a table lists the affected cell types and the reasons why assessment is not possible. The reasons are:

The Performance summary for validation data cell types table lists the remaining cell types in alphabetical order. Performance is measured based on the classifiers' prediction of the cell type for each cell - no cells are left unlabeled. This corresponds to the "Cell type (all)" category of the Cell Clusters element produced by the Predict Cell Types tool.

Three columns are always present:

When no existing classifier is provided, the following column is shown:

When an existing classifier is provided, the following additional columns are shown:

The correct ($ \%$) is calculated as the number of cells that are correctly predicted with the respective cell type out of the the total number of cells that are annotated with the cell type. When multiple validation matrices are used, matrices with more cells will have more influence. This is because each cell is weighted equally. Note that this allows an arbitrary weighting of the validation matrices by choosing subsets of cells in the desired proportions.

Note that large apparent regressions in performance may be spurious if the number of cells in the validation data is very low. For example, if there are 5 cells, a $ 20\%$ regression indicates that only a single additional cell was predicted incorrectly.

Regressions for cell types not/in input data

These tables are only produced when both validation data and an existing classifier are provided. They list cell types in alphabetical order contain a row for each matrix in the Regressed matrices (#) column of the Performance summary for validation data cell types table.

For each matrix, the additional $ \%$ of incorrect predictions is listed, if:

These are divided into three categories, depending on the relationship between the validation and predicted cell type. Direct relationships describe whether two cell types are more or less specific descriptions of the same type. They are found by mapping the two cell types to the QIAGEN Cell Ontology via a list of known synonyms.

Ideally, no cell types should be listed in the Less specific and More specific categories for the Regressions for cell types in input data table. This is because the input data includes both the validation and predicted cell type, so the use of the validation cell type instead of the predicted cell type was deliberate. Such cases always merit investigation.

The presence of cell types in the Less specific category is always a cause for concern. It suggests that the newly trained classifier has lost some of the existing classifier's ability to predict cells specifically.

The presence of cell types in the More specific category can be benign. It suggests that the newly trained classifier has gained the ability to predict cells specifically. Care should be taken to ensure that this explanation is plausible, for example, perhaps the more specific cell type has just been added to the classifier and/or was absent in the matrix for which the regression occurred. If the more specific cell type is localized to a particular tissue e.g. "ovarian vascular surface endothelial cells" instead of "endothelial cells", then it can be checked whether the validation matrix is expected to include cells from that tissue and whether the classifier contains a more appropriate cell type that was not predicted.