Doublet calling
The algorithm for doublet calling contains the following steps.
Doublet simulation
The input expression data is first normalized by using log(1 + scaled expression). Scaling is performed such that the total expression per barcode is 10000. This normalization procedure is very simple, but sufficient for doublet calling. Note that it is different than the normalization described in Normalize Single Cell Data.
The dimension of the data is then reduced by projecting it into PC space. See Feature selection and dimensionality reduction for more details. Note that feature selection is not used here.
Heterotypic doublets are afterwards simulated: one doublet is obtained by averaging the expression of two random barcodes that are sufficiently different from each other. For this, a k-nearest neighbor graph is calculated and two barcodes are considered sufficiently different if they are not found within each other's neighborhoods. The value of k is set from `Neighborhood size (%)'. Note that simulation might fail if this is set too high.
Simulated doublets are normalized and projected into the PC space.
Doublet features calculation
A k-nearest neighbor graph is calculated for all input barcodes and simulated doublets using a pre-defined set of values for k. For each input barcode and simulated doublet, the following doublet features are calculated:
- Is the nearest neighbor a simulated doublet?
- Distance to the nearest neighbor.
- Ratio between the distance to the nearest simulated doublet and nearest input barcode.
- For each value of k, the percentage of neighbors that are simulated doublets.
- For each value of k, the sum of the distances to the neighbors that are simulated doublets, divided by the sum of the distance to all neighbors.
Doublet classification
A Support Vector Machine (SVM) binary classifier is trained using the doublet features from above. Training is performed iteratively:
- In the first iteration, all input barcodes are used in the training data as singlets.
- In the subsequent iterations, input barcodes that are predicted as doublets are removed from the training data.
- The model's performance from each iteration is evaluated by the number of incorrect predictions it makes. There are three kinds of incorrect predictions:
- how many simulated doublets are predicted as singlets;
- how many input barcodes used in the training data are predicted as doublets (as these were assumed to be singlets);
- how many input barcodes not used in the training data are predicted as singlets (as these were assumed to be doublets).
- Training ends when the input barcodes that are are predicted as doublets do not change or the performance of the model does not improve after a number of iterations.
- Doublets are finally predicted using the model with the best performance. All input barcodes that are predicted as doublets are removed.
The SVM produces a doublet score where a positive value indicates a doublet. Input barcodes are predicted as doublets as follows:
- `Specify expected doublets' is unchecked: Input barcodes that have a positive score.
- `Specify expected doublets' is checked: Input barcodes are sorted according to the score and barcodes with the highest scores are assigned as doublets.
- During training, `Expected doublets (%)' determines how many barcodes are predicted as doublets.
- For the final prediction, a doublet score threshold is calculated such that the number of input barcodes with a doublet score above this threshold falls in the interval given by `Expected doublets (%)' `Correction margin (%)' and the threshold is as close to 0 as possible.