Introduction to whole metagenomics analysis (beta)

Classical metagenomics analysis focussed on amplicon sequencing studies targeting the small-subunit ribosomal RNA (16S) locus. Although 16S is both a taxonomically and phylogenetically informative marker, amplicon sequencing typically only provides insight into the taxonomic composition of the microbial community. However, the resolution of these studies is limited and it is cumbersome to directly resolve the biological functions associated with these taxa using this approach [Sharpton, 2014].

Rather than focusing on a single locus, the shotgun genomic sequencing of entire communities has become a viable alternative thanks to the decreasing costs of sequencing protocols. Although it is still more expensive, this approach is applicable to samples of uncultured microbiota and avoids some of the limitations of 16S amplicon sequencing.

The possibility to resolve the biological functions associated with complex communities is very exciting and has applications in various areas ranging from biomedicine to biochemistry and biotechnology. Obviously, in order to arrive at a deeper functional understanding, the required metagenomic data is increasing in both volume and complexity.

This poses additional challenges with respect to bioinformatics algorithms in order to cope with the amount and diversity of the resulting data. Especially in the context of samples with high biodiversity, the functional assignment becomes challenging, partly because of the higher rate of closely related species and uncharacterised organisms with hitherto elusive functions. Furthermore, it is far from straightforward to capture the term "biological function" in a structured way that is accessible to both human experts and bioinformatics algorithms.

Two of the most widely used definitions of biological function are available in the form of the Gene Ontology (GO) and Pfam databases. While GO is a hierarchy of higher-level functional catagories, Pfam (Protein families) classifies proteins into families of related proteins with similar function (see for example "An introduction to the Pfam protein families database", http://pid.nci.nih.gov/2011/110913/full/pid.2011.3.shtml for furter information).

Several tools are available for functional analysis. From a whole metagenome shotgun sequencing dataset as reads, the first step is to assemble the reads using the De Novo Assemble Metagenome tool (see 15.2). The resulting contigs can then be annotated with coding sequences (CDS) using the third-party MetaGeneMark plugin. Given a set of contigs with CDS annotations, the Annotate CDS with Best BLAST Hit (see 15.3) and Annotate CDS with Pfam Domains (see 15.4) tools can be used to annotate all CDS in the annotated contigs with BLAST hits or Pfam protein families and GO terms, respectively. The database needed for GO annotation can be downloaded using the Download GO Database tool (see 15.1), while the Pfam database can be downloaded using the built-in Download Pfam Database tool and BLAST databases can be downloaded or created using the built-in Download BLAST Dabases and Create BLAST Database tools.

Once the contigs are annotated with Pfam annotation, GO terms and/or BLAST hits, the next step will often be to map the original reads back to the annotated contigs, using the built-in Map Reads to Reference tool, in order to be able to assess the abundance of the functional annotations. This last step is performed using the Build Functional Profile tool (15.5).

All tools described above should be run independently for individual samples (or batched), resulting in a functional profile for each sample. A set of functional profiles can then be joined using the Merge Abundance Tables tool (see 15.6). The functional profile of multiple samples can now be visualized and compared as described in Section 3.6.3.

Please Note that the functionality of the whole-metagenomics analysis plugin described within this section is in beta. As this is still a very active research area, the software is accordingly also under active development and subject to change without notice.