Taxonomic Profiling

The Taxonomic Profiling tool is designed to determine which known organisms are in a whole shotgun metagenomic sample, and how abundant they are. To this end, the tool will map each input read to a reference database of complete genomes. If a host organism genome is provided, the mapping phase will disregard reads that it deems to have originated from the host. Paired reads that cannot map as an intact pair will also be dismissed. After mapping, the tool performs qualification (by assigning the read to a taxon in the database if a match is found) and quantification of the abundance of each qualified taxon, and finally compile the results in an abundance table.

The purpose of the qualification phase is to determine whether a particular taxon is represented in a sample. The qualification is based on the confidence score that a reference did not receive its reads by pure chance. Any taxon with a confidence score < 0.995 will be ignored and the reads assigned to it will be reassigned to its closest qualified ancestor.

The purpose of the quantification phase is to estimate how abundant the qualified taxons are. It is based on the number of reads assigned to that taxon.

The detection limit of the tool is now controlled by the single read mapping parameter "Minimum seed length". Increasing this value will give higher precision in the taxa called (true positives), while lowering it will give more taxa called at the cost of precision (more false positives).

To run the Taxonomic Profiling tool, go to Metagenomics (Image wma_folder_open_flat_16_n_p) | Taxonomic Analysis (Image taxonomic_analysis_folder_16_n_p) | Taxonomic Profiling (Image taxonomic_profiler_16_n_p).

You can select only one read file to analyze (figure 6.6). If the sample to be analyzed is split up into several files, the files need to be merged with the Create Sequence List tool before running the Taxonomic Profiling tool.

Image taxpro_1
Figure 6.6: Select the reads.

In the "Parameters" dialog, provide the reference database you will use to map the reads (figure 6.7). Note that the first time you run the tool with a given database, the analysis will take longer because it is indexing and caching the database as indicated by the warning message in the dialog. Analysis time will be improved in subsequent runs, when the workbench is able to use the index generated the first time around.

Image taxpro_2
Figure 6.7: Set the parameters.

In that dialog, you can choose to "Filter host reads". You must then specify the host genome (for example the Homo sapiens hg38 in the case of human microbiota).

The read mapping parameters used in the taxonomic profiler are the standard read mapping parameters (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Mapping_parameters.html) except for the Minimum seed length, which may be specified explicitly here. The Minimum seed length defines the minimum seed (word) size, i.e., perfect match-length, for a position in the reference to be considered a valid candidate when matching the read. If no seed longer than this length can be found in the database, the read is considered "unmapped". Increasing the Minimum seed length will giver higher precision in the results, while lowering it will give a higher recall rate but with more possible false positives.

Finally, checking the option "Auto-detect paired distances" will generate an estimate of the paired distance range in an additional section of the report output by the tool.

In the last dialog, choose from the different output options and Open or Save the results (figure 6.8).

Image taxpro_3
Figure 6.8: Workflow output options.

The tool will generate by default an abundance table as well as a report with a list of the taxons and their abundances. You can choose to output additional files such as a sequence lists of the reads matching the reference database and those matching the host, as well as the unclassified reads.



Subsections