The output of Annotate Single Cell Reads

Annotate Single Cell Reads produces the following outputs:

The annotated reads

The 'annotated reads' contain just the 'Sequence' part of the read structure for the input reads that matched the configured options. These reads are suitable for use in several tools, see Annotate Single Cell Reads. Note that the output reads are sorted by their cell barcode, UMI and hashtag, if used, so they will appear shuffled compared to the input.

The table view of the sequence list (see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Table_view_sequence_lists.html) contains the annotations added to the reads.

The barcode in the 'Cell barcode' column is formed by 'Cell barcode' tags in the read structure and/or the barcode extracted from names, joined by a '-'.

Barcodes that are corrected (see Barcode correction) and/or translated (see 'Multiome ATAC' and 'Parse Biosciences' in Read structure) contain the final barcode under 'Cell barcode', and the uncorrected and untranslated barcode under 'Original barcode'.

The unmatched reads

The 'unmatched reads' contain the input reads that did not match the configured options:

The report

The report includes the following sections:

Summary

This section contains information about the number of:

  • Input, annotated and unmatched reads.
  • Distinct cell barcodes, as well as hashtags, if used. If multiple barcode components are defined, the total number of barcodes reflects the final joined cell barcodes.

Barcode correction

This section is present if barcode correction was used. It contains, for each barcode component, the number of barcodes that were:

  • Identified.
  • Corrected.
  • Discarded, if a whitelist was used.
  • Retained.

Barcode ranks

A plot ranking the cell barcodes in decreasing order of the number of reads (figure 6.4). The number of cells present in the data can be approximated by the location of a sharp fall in the plot.

Image annotatereadsbarcoderankplot
Figure 6.4: A barcode ranks plot. A sharp transition from an average of a bit less than 10,000 reads to less than 100 reads per barcode is seen at x = 5,000, suggesting that there are approximately 5,000 cells in the data.

Nucleotide counts

Plots showing the distributions of different nucleotides at each position for the tags extracted using the Read structure (cell barcode, UMI, hashtag) for the 'annotated reads'.

The distributions for cell barcodes and UMIs are both expected to be roughly uniform, while the distribution for hashtags varies depending on the number of distinct hashtags in the data. The distribution becomes more uniform with more hashtags, but if only a few hashtags are expected, for example when the hashtag represents the sample, the distribution is likely to be skewed and should reflect the known expected hashtags.

For simplicity, the remainder of this section will talk about 'barcodes', but the description is equally true for UMIs.

Typically, barcodes are randomly generated, or else designed to be very different from each other, such that all nucleotides are observed at each barcode position, and in approximately equal amounts. Errors may be detected when the barcode plots do not show this behavior, such as in figure 6.5, where position 1 in the barcode is mostly 'A', position 2 is mostly 'A', position 3 is mostly 'G' etc. It appears that one barcode contains almost all the reads in the sample. In this case, the cell barcode part of the read structure has been misconfigured to read an adapter with sequence 'AAGCAGTGGT'. The same plot with the correct read structure is shown in figure 6.6.

Image badbarcode
Figure 6.5: Nucleotide counts plot for a misconfigured barcode. One barcode with sequence 'AAGCAGTGGT' is present in most of the reads. In this case the barcode was misconfigured to be part of an adapter.

Image goodbarcode
Figure 6.6: Nucleotide counts plot for the same data as in figure 6.5. All nucleotides are seen at all positions of the barcode with comparable frequencies, except for at position 1. This dataset is from a 96-well protocol where the barcodes for each well are known in advance and the skewed nucleotide distribution at position 1 is as expected.