Interpreting the output of Annotate Single Cell Reads
The primary outputs of Annotate Single Cell Reads are Sequence Lists () of `annotated reads'. These contain just the `Sequence' part of the read structure and are suitable for use in several tools as described in Annotate Single Cell Reads. Note that the output reads are sorted by their cell barcode, UMI and hashtag, if used, so they will appear shuffled compared to the input.
For each input, another sequence list of unmatched reads can be produced. This contains reads that do not match the provided read structure. In most cases this list will contain reads that are too short according to the configured read structure. If longer reads are present it may be worth checking that the read structure includes a variable length part.
The Annotate Single Cell Reads report
The report includes summary statistics, a barcode ranks plot, and plots of the distributions of different nucleotides at each position in the cell barcode, UMI and hashtag, if present.
The summary statistics section shows the number of input and output reads together with the number of distinct cell barcodes (figure 6.5). If hashtags are used, the number of distinct hashtags is also shown in the summary.
Typically the number of distinct barcodes is large as it includes all barcodes, including those that arise from sequencing error. The number of cells can be approximated by the location of a sharp fall in the corresponding barcode ranks plot, which ranks the barcodes in decreasing order of the number of reads (figure 6.6).
Figure 6.5: Summary statistics for data where R1 is discarded after the cell barcode and UMI have been extracted. In this example, 193 767 838 reads are present in the input. After discarding R1, 193 767 838 / 2 = 96 883 919 reads are present in the output. There are 1 968 804 distinct cell barcodes.
Figure 6.6: The barcode ranks plot for the data shown in figure 6.5. A sharp transition from an average of >10000 reads to <100 reads per barcode is seen at x = 10000, suggesting that there are approximately 10000 cells in the data.
The plots of the distributions of different nucleotides at each position of the cell barcode/UMI/hashtag are made using all the `annotated reads'.
The distributions for cell barcodes and UMI are both expected to be roughly uniform, as described below. However, the distribution for hashtags depends on the number of distinct hashtags present in the data. The more hashtags, the more uniform the distribution should be. On the other hand, if only a few hashtags are expected, for example when the hashtag represents the sample, the distribution is more likely to be skewed and should reflect the known expected hashtags.
For simplicity, the remainder of this section will talk about `barcodes', but the description is equally true for UMIs.
Typically, barcodes are randomly generated, or else designed to be very different from each other, such that all nucelotides are observed at each barcode position, and in approximately equal amounts. Errors may be detected when the barcode plots do not show this behavior, such as in figure 6.7, where position 1 in the barcode is mostly `A', position 2 is mostly `A', position 3 is mostly `G' etc. It appears that one barcode contains almost all the reads in the sample. In this case, the cell barcode part of the read structure has been misconfigured to read an adapter with sequence `AAGCAGTGGT'. The same plot with the correct read structure is shown in figure 6.8.
Figure 6.7: Nucleotide distribution plot for a misconfigured barcode. One barcode with sequence `AAGCAGTGGT' is present in most of the reads. In this case the barcode was misconfigured to be part of an adapter.
Figure 6.8: Nucleotide distribution plot for the same data as in figure 6.7. All nucleotides are seen at all positions of the barcode with comparable frequencies, except for at position 1. This dataset is from a 96-well protocol where the barcodes for each well are known in advance. In this case, it was possible to verify that the nucleotide distribution at position 1 should be skewed.