Classify Long Read Amplicons

Classify Long Read Amplicons is meant for classification of long-read single-end amplicon sequencing data and was inspired by [Curry et al., 2022]. Reads are mapped to an amplicon reference database of choice and subsequently assigned the most likely taxonomy based on the probability calculated from a series of expectation maximization rounds. The tool can be used for both error-prone and high-accuracy reads.

Note: the tool has not been tested with error-prone PacBio (PacBio CLR) reads, but there is no reason to suspect these reads to be incompatible with the tool. Should you experience issues, please contact QIAGEN Bioinformatics Support team at ts-bioinformatics@qiagen.com.

For tool memory requirements, see (System requirements). Comprehensive reference databases are needed for reliable species-level resolution, but large databases also require more memory. For example, the RefSeq Prokaryotic 16S database (see Download Amplicon-Based Reference Database) should not require more than 16GB RAM, but due to its limited number of sequences, only genus-level resolution is to be expected. The tool does not deduplicate the provided reference database, so make sure that databases are non-redundant in order to save memory and runtime.

The underlying algorithm works in five overall steps:

  1. Mapping reads to references. Reads are mapped to the provided reference database. At this stage each read gets assigned one primary alignment based on mapping parameters, but up to 50 secondary alignments are retained for the following expectation maximization step.
  2. Calculate error model. An error model of the probability of each alignment type is calculated from all primary alignments. The alignment types are mismatch, insertion, deletion, and softclip.
  3. Initialize alignment probabilities. For each read-to-taxonomy pair the probability of the alignment is calculated based on the error model. Depending on the composition of the reference database given, a read may map to multiple references with the same taxonomy i.e., multiple identical read-to-taxonomy pairs exist. In that case, the highest alignment probability between them is used.
    In this step the initial probability of the taxonomies is also set with the assumption that all taxonomies in the reference are equally likely.
  4. Expectation maximization. The algorithm loops through multiple rounds of expectation maximization. In each round, the probability that each read came from that taxonomy for each read-to-taxonomy pair is calculated using the alignment and taxonomy probabilities. The taxonomy probabilities are then updated and the log-likelihood of the estimate is calculated. This loop continues until the increase in log-likelihood falls below the threshold (>0.01) compared to the previous iteration.
  5. Abundance threshold cut-off and reassignment. When the log-likelihood increase falls below the threshold, a final round of expectation maximization is entered. Here, the taxonomies falling below the set minimum abundance threshold are removed and their assigned abundance is reassigned to the most likely taxonomies among the retained taxonomies.



Subsections