Taxonomic Profiling

The Taxonomic Profiling tool is designed to determine which known organisms are in a whole shotgun metagenomic sample, and how abundant they are. To this end, the tool will map each input read to a reference database of complete genomes. If a host organism genome is provided, the mapping phase will disregard reads that it deems to have originated from the host. Paired reads that cannot map as an intact pair will also be dismissed. After mapping, the tool performs qualification (by assigning the read to a taxon in the database if a match is found) and quantification of the abundance of each qualified taxon, and finally compile the results in an abundance table.

The purpose of the qualification phase is to determine whether a particular taxon is represented in a sample. The qualification is based on the confidence score that a reference did not receive its reads by pure chance. Any taxon with a confidence score < 0.995 will be ignored and the reads assigned to it will be reassigned to its closest qualified ancestor.

The purpose of the quantification phase is to estimate how abundant the qualified taxons are. It is based on the number of reads assigned to that taxon. For data sets with varying read length, the abundance values are adjusted to correct for a skewed read distribution between taxa.

Adjusted abundance = (abundance in nucleotides) / (mean read length)

For data sets where all reads have similar length, the adjusted abundance equals the read count assigned to a taxa.

The detection limit of the tool is now controlled by the single read mapping parameter "Minimum seed length". Increasing this value will give higher precision in the taxa called (true positives), while lowering it will give more taxa called at the cost of precision (more false positives).

To run the Taxonomic Profiling tool, go to Metagenomics (Image wma_folder_open_flat_16_n_p) | Taxonomic Analysis (Image taxonomic_analysis_folder_16_n_p) | Taxonomic Profiling (Image taxonomic_profiler_16_n_p).

You can select only one read file to analyze (figure 6.6). If the sample to be analyzed is split up into several files, the files need to be merged with the Create Sequence List tool before running the Taxonomic Profiling tool.

Image taxpro_1
Figure 6.6: Select the reads.

In the "Parameters" dialog, provide the reference database you will use to map the reads (figure 6.7). It is also possible to "Filter host reads". You must then specify the host genome (for example Homo sapiens hg38 in the case of human microbiota). The reference database can be obtained by using the Download Microbial Reference Database tool (Download Microbial Reference Database), and both indexes are built with the Create Taxonomic Profiling Index tool (Create Taxonomic Profiling Index).

Image taxpro_2
Figure 6.7: Set the parameters for taxonomic profiling.

The read mapping parameters used in the taxonomic profiler are the standard read mapping parameters (see except for the Minimum seed length, which may be specified explicitly here. The Minimum seed length defines the minimum seed (word) size, i.e., perfect match-length, for a position in the reference to be considered a valid candidate when matching the read. If no seed longer than this length can be found in the database, the read is considered "unmapped". Increasing the Minimum seed length will giver higher precision in the results, while lowering it will give a higher recall rate but with more possible false positives.

Finally, checking the option "Auto-detect paired distances" will generate an estimate of the paired distance range in an additional section of the report output by the tool.

In the last dialog, choose from the different output options and Open or Save the results (figure 6.8).

Image taxpro_3
Figure 6.8: Tool output options.

The tool will generate by default an abundance table as well as a report with a list of the taxons and their abundances. You can choose to output additional files such as a sequence list of the reads matching the reference database and those matching the host, as well as the unclassified reads.