Download Curated Microbial Reference Database
The Download Curated Microbial Reference Database tool downloads selected reference databases as single sequence lists and/or taxonomic profiling indices with the necessary annotations required for the tools in the Typing and Epidemiology and Metagenomics sections of the Microbial Genomics Module.
To run the tool, go to:
Toolbox | Microbial Genomics Module () | Metagenomics (
) | Databases (
) | Taxonomic Analyses (
) | Download Curated Microbial Reference Database (
)
In the first window (figure 17.1), select the database you wish to download.
Figure 17.1: Select the database and output format
You can choose between several databases
- QMI-PTDB Genus: QIAGEN Microbial Insights - Prokaryotic Taxonomy Database is a microbial reference database for taxonomic profiling of bacteria and archaea. The database represents all genera with a varying number of species per genus.
Genome sequences and annotations are from the NCBI Reference Sequence Database (RefSeq; https://www.ncbi.nlm.nih.gov/refseq/) and have been annotated with taxonomy from the Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org).
The database was created by selecting one representative genome per species, and subsequently reducing the relative number of species per genus to meet the desired database size. For reduction, higher assembly status, lower number of contigs, and longer total length was prioritized. All genomes marked as "reference genome" were retained. So were species commonly included in microbial reference standards.
When running Taxonomic Profiling with the QMI-PTDB Genus database, 32GB of memory is required. - QMI-PTDB Family: QIAGEN Microbial Insights - Prokaryotic Taxonomy Database is a microbial reference database for taxonomic profiling of bacteria and archaea. The database represents all families with a varying number of genera per family.
Genome sequences and annotations are from the NCBI Reference Sequence Database (RefSeq; https://www.ncbi.nlm.nih.gov/refseq/) and have been annotated with taxonomy from the Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org).
The database was created by selecting one representative genome per genus, and subsequently reducing the relative number of genera per family to meet the desired database size. For reduction, higher assembly status, lower number of contigs, and longer total length was prioritized. All genomes marked as "reference genome" were retained. So were species commonly included in microbial reference standards.
When running Taxonomic Profiling with the QMI-PTDB Family database, 16GB of memory is recommended. - Unified Human Gastrointestinal Genome (UHGG): A database for taxonomic and functional profiling of human gut samples curated and hosted by EMBL-EBI[Almeida et al., 2021]. The database includes metagenome assembled genomes from human gut samples.
- Unclustered Reference Viral DataBase (U-RVDB): Unclustered Reference Viral Database for virus detection [Goodacre et al., 2018]. The database includes curated viral, virus-related and virus-like nucleotide sequences except bacterial viruses which are excluded.
- Clustered Reference Viral DataBase (C-RVDB) : Clustered Reference Viral Database for virus detection. Viral entries are clustered at 98% by CD-HIT-EST.
- ViraCuraTM HPV REF: A curated database of Human Papillomavirus reference strains. It contains unmodified viral reference genomes and associated record information from NCBI databases.
- ViraCuraTM HPV VAR: A curated database of Human Papillomavirus variants of reference strains. It contains unmodified viral reference genomes and associated record information from NCBI databases.
- ViraCuraTM ANIMAL PV: A curated database of Animal Papillomavirus. It contains unmodified viral reference genomes and associated record information from NCBI databases.
- MPXV: A curated database of Monkeypox virus reference strains. It contains unmodified viral reference genomes and associated record information from NCBI databases, as well as metadata and customized taxonomic nomenclature.
- MOCOVA: A curated database of Monkeypox outgroup reference strains (Molluscum contagiosum, Cowpox, Variola, and Vaccinia). It contains unmodified viral reference genomes and associated record information from NCBI databases, as well as metadata and customized taxonomic nomenclature.
You can then chose to download the database as an annotated sequence list and/or as a taxonomic profiling index.
The Curated Microbial Reference Databases are optimized for balance in the taxonomic representation across the taxonomy, i.e. the oversampling of some branches of the taxonomy is removed by using representative sequences. This has the consequence that some assemblies may not be particularly good assemblies, yet they are included as they constitute the best current representative of the given branch in the taxonomy. For this optimized database you can choose to download the 22g database, or one that is optimized for running the Taxonomic Profiling tool on a laptop computer with 16GB of main memory. The 16g version of the curated database contain a smaller number of assemblies, in order to be able to run on a system with 16GB of main memory.
Note: some of the databases offered are derived works, licensed under a Creative Commons Attribution-ShareAlike (CC BY-SA) license. We offer free access to those without requiring a CLC product license. They can be downloaded using the CLC Genomics Workbench with the Microbial Genomics Module installed in viewing mode. The downloaded files can then be exported to non-proprietary formats using the freely available viewing mode of the CLC Genomics Workbench.
Subsections