Taxonomic Profiling
Taxonomic profiling provides insight into the taxonomic composition of whole metagenome samples and estimates the relative abundance of the detected taxa.
Reads are mapped to a reference genome database and are assigned to a reference genome or higher taxonomy level based on their mapping quality score, i.e. the confidence that the read is correctly mapped:
- Reads that map to only one reference location are assigned to that genome.
- Reads that map best to one reference location, but where other almost as good alternatives are found, are assigned to the taxonomy one level up from the best-match genome.
- Reads that map equally well to more than one reference location are assigned to the lowest common ancestor.
If a host genome is provided, reads that map better to this are filtered. Reads are mapped individually to the reference genome database and the host genome. Reads that map to both are assigned to the match with higher mapping score.
For paired reads, when a read pair is broken, either because only one read in the pair matches, or the distance or relative orientation is wrong, both reads are discarded.
Following mapping of reads, qualification and quantification steps refine the results:
- Qualification. Determines whether a particular taxon is represented in a sample. This calculation is based on a confidence scores; of whether a reference sequence was assigned reads by pure chance. Any taxon with a confidence score < 0.995 will be ignored and reads will be reassigned to its closest qualified ancestor. By construction, the confidence score is very close to 1.0 except on the Kingdom level of the taxonomy, thus it is not reported.
- Quantification. Calculates the abundance of qualified taxa based on the number of assigned reads.
For data sets with varying read length, the abundance values may optionally be adjusted to correct for a skewed read distribution between taxa, see Adjust for read length variation in Taxonomic Profiling parameters.
Subsections