Bin Pangenomes by Taxonomy
This tool assigns contigs and the reads they are composed of into bins with other contigs presumably of closely related taxonomy. For this we use a microbial reference (genome) sequence database, which comprises sequences with taxonomic information. Furthermore, in order to separate contigs that originate from plasmids from those of genomic origin, the Bin Pangenomes by Taxonomy tool additionally takes a plasmid database as input.
Binning occurs in 5 consecutive steps:
- Obtain taxonomic information for reads
- Obtain plasmid information for reads
- Map reads to contigs
- Assign taxonomic and plasmid labels to contigs
- Group and filter contigs according to labels (Contig purity)
The Bin Pangenomes by Taxonomy takes one or several single or paired-end read files as input (figure 6.1).
The tool is designed to work on contigs assembled from the same set of reads used as input, previously assembled using the De Novo Assembly Metagenome tool (as in the workflow, see QC, Assemble and Bin Pangenomes). You can also specify here the minimum contig length desired (figure 6.2).
Figure 6.2: Select the references and specify the parameters needed for running the tool.
As reference databases, one or two Taxonomic Profiling index files can be provided (figure 6.26) :
- the file provided as "References" is used to find taxonomic information for the reads
- the file provided as "Plasmid reference" (once the "Find plasmid information option is checked) is used to distinguish genomic reads from plasmid reads.
Both references can be obtained by using the Download Microbial Reference Database tool (Download Microbial Reference Database), and the indexes are built with the Create Taxonomic Profiling Index tool (Create Taxonomic Profiling Index).
Depending on the dataset, it may be necessary to adapt the contig purity settings, where "Maximum level" refers to a maximum level in the taxonomic tree and where a specific "Minimum purity" per contig needs to be reached in order for it to be considered a part of a bin. For example, if Maximum level = Genus and Minimum purity = 0.8 and 512 reads map to a given contig, at least 0.8 * 512 = 410 reads need to have the same Genus level taxonomy in order for the contig to become part of the respective bin. If more precise taxonomic information is available (e.g., on Species level) with the requested minimum purity, this information will be used instead.
In the "Result handling" dialog it can be specified whether to collect the read mappings and which kind (all, only impure contigs) to collect the ignored reads (reads not mapping to contigs) and whether to output a quality report for the bins (figure 6.3).
Figure 6.3: Select the references and specify the parameters needed for running the tool.
The standard output of the Bin Pangenomes by Taxonomy tool consists of a list of (binned) contigs and one sequence list per input reads file (or two for paired reads) where each of the sequences is labeled according to its most probable origin and bin it ended up in (the bin annotation is stored as "Assembly ID" annotation in order for it to work seamlessly with other tools). Also, a column called "isPlasmid" provides a true/false label whether the contig/read was mapped respectively to a plasmid or a genome. The tool can also output a Taxonomy binning report.