Taxonomic Profiling

Taxonomic profiling provides insight into the taxonomic composition of whole metagenome samples and estimates the relative abundance of the detected taxa.

Reads are mapped to a reference genome database and are assigned to a reference genome or higher taxonomy level based on their mapping quality score, i.e. the confidence that the read is correctly mapped:

If a host genome is provided, reads that map better to this are removed. For paired reads, when a read pair is broken, either because only one read in the pair matches, or because the distance or relative orientation is wrong, both reads are discarded.

After mapping, the tool performs qualification and quantification of the abundance of each qualified taxon, and finally compiles the results into an abundance table.

The purpose of the qualification phase is to determine whether a particular taxon is represented in a sample. The qualification is based on a confidence score that a reference did not receive its reads by pure chance. Any taxon with a confidence score < 0.995 will be ignored and the reads assigned to it will be reassigned to its closest qualified ancestor. By construction the confidence score is very close to 1.0 except on the Kingdom level of the taxonomy, thus it is not reported.

The purpose of the quantification phase is to estimate how abundant the qualified taxa are. It is based on the number of reads assigned to that taxon. For data sets with varying read length, the abundance values may optionally be adjusted to correct for a skewed read distribution between taxa (see below).

The detection limit of the tool is now controlled by the single read mapping parameter "Minimum seed length". Increasing this value will give higher precision in the taxa called (true positives), while lowering it will give more taxa called at the cost of precision (more false positives).

To run the Taxonomic Profiling tool, go to

        Toolbox | Microbial Genomics Module (Image mgm_folder_closed_flat_16_h_p) | Metagenomics (Image wma_folder_open_flat_16_n_p) | Taxonomic Analysis (Image taxonomic_analysis_folder_16_n_p) | Taxonomic Profiling (Image taxonomic_profiler_16_n_p)

You can select only one read file to analyze (figure 6.6). If the sample to be analyzed is split up into several files, the files need to be merged with the Create Sequence List tool before running the Taxonomic Profiling tool.

Image taxpro_1
Figure 6.6: Select the reads.

In the "Parameters" dialog, provide the reference database you will use to map the reads (figure 6.7). It is also possible to "Filter host reads". You must then specify the host genome as a taxonomic profiling index (for example Homo sapiens GRCh38 in the case of human microbiota). More generally, the "Host genome index" acts as a background database for reads that are expected to be present in your sample but unimportant to your analysis and may hence obscure the final abundances due to misassignments. For example, if the interest is to only see the bacterial fraction of a metagenome, a viral/fungal background database may be supplied as "host genome" to remove viral/fungal reads from the analysis. The reference database can be obtained by using the Download Curated Microbial Reference Database tool (Download Curated Microbial Reference Database) or Download Custom Microbial Reference Database tool (Download Custom Microbial Reference Database). The host index and (if using the custom downloader) the microbial reference index are built with the Create Taxonomic Profiling Index tool (Create Taxonomic Profiling Index).

Image taxpro_2
Figure 6.7: Set the parameters for taxonomic profiling.

The read mapping parameters used in Taxonomic Profiling are the standard read mapping parameters (see except for the Minimum seed length, which may be specified explicitly here. The Minimum seed length defines the minimum seed (word) size, i.e., perfect match-length, for a position in the reference to be considered a valid candidate when matching the read. If no seed longer than this length can be found in the database, the read is considered "unmapped". Increasing the Minimum seed length will giver higher precision in the results, while lowering it will give a higher recall rate but with more possible false positives.

The option "Auto-detect paired distances" will generate an estimate of the paired distance range in an additional section of the report output by the tool and the estimated distances are used in the qualification phase, i.e. both reads of a read pair are required to map to the same reference sequence within the estimated distance to be assigned to the corresponding taxon.

When enabled, the option "Adjust read count abundances" will weight the reads assigned to a taxon by the number of nucleotides mapped to that taxon, otherwise raw read counts are reported. Specifically, the adjusted abundance is calculated as

Adjusted abundance = (abundance in nucleotides) / (average mapped read length),

where the denominator is the average length of all mapped reads. For data sets where all reads have similar length, the adjusted abundance is very similar to the read count assigned to a taxon. Note that occasionally, this weighting may lead to zero reads assigned to a qualified taxon if there are only few, shorter than average reads assigned to this taxon.

In the last dialog, choose from the different output options and Open or Save the results (figure 6.8).

Image taxpro_3
Figure 6.8: Tool output options.

The tool will generate by default an abundance table as well as a report with a list of the taxa and their abundances. You can choose to output additional files such as a sequence list of the reads matching the reference database and those matching the host, as well as the unclassified reads.