The doublet calling algorithm
The algorithm for doublet calling contains the following steps.
Doublet simulation
The input expression data is first normalized by using log(1 + scaled expression). Scaling is performed such that the total expression per barcode is . This normalization procedure is very simple, but sufficient for doublet calling. Note that it is different than the normalization described in Normalize Single Cell Data.
The dimension of the data is then reduced by projecting it into PC space. See Feature selection and dimensionality reduction for more details. Note that feature selection is not used here.
Heterotypic doublets are afterwards simulated: one doublet is obtained by averaging the expression of two random barcodes that are sufficiently different from each other. For this, a k-nearest neighbor graph is calculated and two barcodes are considered sufficiently different if they are not found within each other's neighborhoods. The value of k is set from 'Neighborhood size (%)'. Note that simulation might fail if this is set too high.
Simulated doublets are normalized and projected into the PC space.
Doublet features calculation
A k-nearest neighbor graph is calculated for all input barcodes and simulated doublets using a pre-defined set of values for k. For each input barcode and simulated doublet, the following doublet features are calculated:
- Is the nearest neighbor a simulated doublet?
- Distance to the nearest neighbor.
- Ratio between the distance to the nearest simulated doublet and nearest input barcode.
- For each value of k, the percentage of neighbors that are simulated doublets.
- For each value of k, the sum of the distances to the neighbors that are simulated doublets, divided by the sum of the distance to all neighbors.
Doublet classification
A Support Vector Machine (SVM) binary classifier is trained using the doublet features from above. Training is performed iteratively:
- In the first iteration, all input barcodes are used in the training data as singlets.
- In the subsequent iterations, a number of input barcodes that are most likely to be doublets are removed from the training data. This number is determined by the 'Expected doublets (%)' option.
- Each model's performance is evaluated by the number of incorrect predictions it makes. There are three kinds of incorrect predictions:
- how many simulated doublets are predicted as singlets;
- how many input barcodes used in the training data are predicted as doublets (as these were assumed to be singlets);
- how many input barcodes not used in the training data are predicted as singlets (as these were assumed to be doublets).
- Training ends when the input barcodes most likely to be doublets do not change or the performance of the model does not improve after a number of iterations.
Doublets are predicted using the model with the best performance. The SVM produces a doublet score where a positive value indicates a doublet. A doublet score threshold is calculated such that the number of input barcodes with a doublet score above this threshold falls in the interval given by 'Expected doublets (%)' 'Correction margin (%)' and the threshold is as close to 0 as possible. All input barcodes with a doublet score above this threshold are removed as doublets.