The cell calling algorithm
The algorithm for cell calling is based on EmptyDrops [Lun et al., 2019]:
- The ambient RNA profile is estimated from the barcodes with low number of reads. The expressions from these barcodes are added together and a proportion vector for the ambient profile is obtained using the Good Turing algorithm [Gale and Sampson, 1995].
- If Specify minimum number of reads for barcodes to be retained is disabled (see Empty droplets filter), the knee is automatically estimated from the log-log rank plot (figure 5.5). The knee estimation does not follow that from EmptyDrops. Instead, the knee is identified on a smoothed data using an adaptation of the [Satopaa et al., 2011] algorithm implemented in https://github.com/mariolpantunes/ml.
- Barcodes with fewer number of reads than the automatically inferred knee or a manually specified threshold, are tested for significant deviations from the ambient profile. For each barcode, the probability of obtaining its expression profile from the ambient is calculated. A p-value is obtained from the probabilities of ambient simulated barcodes containing the same total number of reads.
- FDR correction is applied to the p-values for barcodes that are not part of the ambient profile.
- Barcodes with FDR-corrected p-values below the provided value in FDR threshold (see Empty droplets filter) are retained as non-empty droplets.