The cell calling algorithm
Barcodes with a high number of reads are always retained as cell-containing droplets. If Specify minimum number of reads for barcodes to be retained is disabled (see Empty droplets filter), the automatically estimated knee is used for detecting such barcodes. The knee is identified from the smoothed log-log rank data (figure 5.9) where the barcodes considered to only contain ambient RNA are removed. An adaptation of the [Satopaa et al., 2011] algorithm implemented in https://github.com/mariolpantunes/ml is used.
The algorithm for testing if barcodes with an intermediate number of reads are cells is based on EmptyDrops [Lun et al., 2019]:
- The ambient RNA profile is estimated from the barcodes with low number of reads. The expressions from these barcodes are added together and a proportion vector for the ambient profile is obtained using the Good Turing algorithm [Gale and Sampson, 1995].
- Barcodes with an intermediate number of reads are tested for significant deviations from the ambient profile. For each barcode, the probability of obtaining its expression profile from the ambient is calculated. A p-value is obtained from the probabilities of ambient simulated barcodes containing the same total number of reads.
- FDR correction is applied to the p-values for barcodes that are not part of the ambient profile.
- Barcodes with FDR-corrected p-values below the provided value in FDR threshold (see Empty droplets filter) are retained as non-empty droplets.