The doublet calling algorithm

The algorithm for doublet calling contains the following steps.

Doublet simulation

The input expression data is first normalized by using log(1 + scaled expression). Scaling is performed such that the total expression per barcode is $ 10000$. This normalization procedure is very simple, but sufficient for doublet calling. Note that it is different than the normalization described in Normalize Single Cell Data.

The dimension of the data is then reduced by projecting it into PC space. See Feature selection and PCA for more details. Note that feature selection is not used here.

Heterotypic doublets are afterwards simulated: one doublet is obtained by averaging the expression of two random barcodes that are sufficiently different from each other. For this, a k-nearest neighbor graph is calculated and two barcodes are considered sufficiently different if they are not found within each other's neighborhoods. The value of k is set from 'Neighborhood size (%)'. Note that simulation might fail if this is set too high.

Simulated doublets are normalized and projected into the PC space.

Doublet features calculation

A k-nearest neighbor graph is calculated for all input barcodes and simulated doublets using a pre-defined set of values for k. For each input barcode and simulated doublet, the following doublet features are calculated:

Doublet classification

A Support Vector Machine (SVM) binary classifier is trained using the doublet features from above. Training is performed iteratively:

Doublets are predicted using the model with the best performance. The SVM produces a doublet score where a positive value indicates a doublet. A doublet score threshold is calculated such that the number of input barcodes with a doublet score above this threshold falls in the interval given by 'Expected doublets (%)' $ \pm$ 'Correction margin (%)' and the threshold is as close to 0 as possible. All input barcodes with a doublet score above this threshold are removed as doublets.