Clustering of Operational Taxonomic Units

The majority of microbial species present in the human body or indeed anywhere in the environment have never been isolated, cultured or sequenced, due to our inability to reproduce necessary growth conditions in the lab. Therefore, there are huge amounts of organismal and functional novelty still to be discovered. Two central questions in microbial community analysis ask: Which microbial species are present in a sample from a given habitat, and at what frequencies? Microbiome analysis takes advantage of DNA molecular techniques and sequencing technology in order to comprehensively retrieve specific regions of microbial genomic DNA useful for taxonomical identification. For bacteria, the most widely used regions are parts of the 16S rRNA gene. In a microbiome analysis workflow, total genomic DNA is extracted from the sample(s) of interest, a region of the 16S gene is PCR amplified, and the resulting amplicon is sequenced using an NGS machine. The bioinformatics task is then to assign taxonomy to the reads and tally their occurrences. Due to the incomplete nature of bacterial taxonomy and presence of sequencing errors in the NGS reads, a common approach is to cluster reads at some level of similarity into pseudospecies called Operational Taxonomical Units (OTUs), where all reads within e.g. 97% similarity are clustered together and represented by a single sequence. PCR amplification can introduce artefacts in the form of chimeric sequences, where template swapping results in a sequence having two or more parental templates. These chimeras can be identified during the clustering step.



Subsections