Features used for training and prediction
An expression matrix has a set of features associated with it, which is either specified as a gene or transcript track when importing a matrix (see Import Expression Matrix), or as a gene track when creating a matrix by mapping reads using the Single Cell RNA-Seq Analysis (see Single Cell RNA-Seq Analysis).
As feature expression is used for training a classifier and for predicting the cell types of new cells, it is important that the features used for training, validating and predicting are compatible. The two sets of features are mapped against each other to find matching features.
In order to do this, the ids of the features are used. If fewer than of the features are found to be matching, several mappings are created:
- Both feature sets are mapped to three standard gene annotation databases: Ensembl, Entrez and HGNC, and an internal mapping between these databases is used to then match the features from the two sets.
- Features are mapped by name.
The mapping resulting in the largest percentage of matching features from the classifier is used. If this percentage is less than , the tool will fail with a relevant warning message. This means that the two feature sets are incompatible.
Two pre-trained cell type classifiers are available through the Reference Data Manager (see Reference data management), one for human and another one for mouse. These classifiers have been trained on a subset of genes, which are protein coding and are found in both the Ensembl and Entrez gene annotations databases. Therefore, these classifiers should be compatible with most data sets.