The Single Cell ATAC-Seq Analysis algorithm
As a first step, Single Cell ATAC-Seq Analysis de-duplicates reads if they have the same "logical" start positions and the same cell barcode. De-duplication is necessary because otherwise duplicated reads can pile up at a position and look like peaks. One read is kept from each set of duplicates. Ambiguously mapping reads and reads that are not in pairs are also discarded. All subsequent analysis is performed on this de-duplicated read mapping, which is also one of the tool's outputs.
Peaks are then called using the CLC Shape-based Peak Caller [Strino and Lappe, 2016]. As the peak caller does not explicitly depend on read coverage, there is no merging of nearby peaks to rescue regions where multiple small peaks would not meet a coverage threshold. This means that more peaks will typically be called than in approaches that merge peaks, and peaks will more often contain just 0, 1 or 2 reads per cell as expected.
Counting reads per peak and cell
It is standard practice to correct the read start site by +4bp for forward reads and -5bp for reverse reads such that the reads start at the center of the transcription factor binding site [Buenrostro et al., 2013]. Reads are counted for a peak if the corrected read start for either read in a pair is contained within the peak.
Finding transcription factors
The sequence of each peak is scanned for transcription factor binding motifs from the HOCOMOCO v11 Human Core Collection (i.e. those with quality A, B or C) [Kulakovskiy et al., 2018] using the SPRY-SARUS library. It is not possible to provide a custom motif database.
HOCOMOCO provides two types of matrices: mononucleotide, which scores positions individually, and dinucleotide, which scores two positions at a time. The use of dinucleotide matrices is preferred because their matches are more precise. For example, a mononucleotide matrix "AG(C/G)(T/C)A" might match the sequences "AGCTA", "AGGTA", "AGCCA", and "AGGCA". The equivalent dinucleotide matrix might know that the C at position 3 is always followed by a T, whereas the G is always followed by a C and so the valid matching sequences are "AGCTA" and "AGGCA". We use dinucelotide matrices when they are available, and otherwise fall back to the mononucleotide matrix.
Although only human HOCOMOCO motifs are searched, there is considerable orthology between species, such that results can be informative for other species.
Motifs are reported if they meet a score threshold corresponding to a p-value of 0.0001. All motifs exceeding this threshold are output by Single Cell ATAC-Seq Analysis in an annotation track.
A footprinting algorithm is used to detect "valleys" within the peaks that suggest the presence of transcription factor binding. The intuition is that the Tn5 transposase cannot cut the DNA at a position where a transcription factor is bound, and so the peak signal is lower where binding occurs. A "footprint" score is calculated at each position within a peak and used to filter away transcription factor binding sites for which there is little evidence of binding. The reported transcription factors for a peak are those with a footprint score above the calculated threshold.
The footprinting algorithm used is a Java re-implementation of the ATAcorrect, scoreBigWig and BINDetect (for one sample) modules of TOBIAS [Bentsen et al., 2020]. Briefly, the insertion bias of the Tn5 transposase is learned from the cut sites in the mapped reads. The observed cut sites are then corrected for this bias. The footprinting score is calculated over a window. Better scores are obtained if the number of corrected cut sites is high in the flanks of the window and low in the center. A footprint score threshold is calculated from a background of randomly sampled positions in peaks, under the assumption that most positions in peaks do not have bound transcription factors.
There are some minor differences compared to the TOBIAS v0.12.10 implementation. The most notable is that the determination of the footprinting score threshold starts with a simple Gaussian fit as opposed to a 2-component Gaussian mixture model. This is because peak calling is performed as part of Single Cell ATAC-Seq Analysis and so there is no need to model peaks that are unobserved in the input.
Finding nearby genes
Nearby genes for a peak are likely to be regulated by that peak. Genes are assigned to peaks as follows:
- If the peak center is within -1000nt to +100nt of a gene's transcription start site (TSS), then the gene is a promoter gene for that peak. There may be multiple promoter genes for one peak. Transcription start sites are here defined as being the first nucleotide in any of the supplied mRNA or gene annotations.
- If no promoter gene is found:
- we look for a TSS within -200kb to +200kb of the peak center. The closest gene (or genes) within that range are distal genes for the peak. There is usually only one closest TSS.
- we also look for genes and transcripts overlapping the peak center. Any such genes are distal genes for the peak. There may be many such genes, but usually there are none or one.
Note that it is possible to precisely control which genes and transcripts are used for finding nearby genes by providing custom gene and mRNA tracks to Single Cell ATAC-Seq Analysis.
The distinction between promoter and distal genes is only used in the Single Cell ATAC-Seq Analysis report, and on export for compatibility with third party tools. It cannot be viewed in the Peak Count Matrix.