Download Microbial Reference Database
The Download Microbial Reference Database tool downloads selected references from GenBank and RefSeq, and outputs a single sequence list with all the necessary annotations for the taxonomic profiling (i.e., assembly IDs).
To run the tool, go to:
Databases () | Taxonomic Analyses () | Download Microbial Reference Database ()
In the first window (figure 20.1), select the source of the database you wish to generate.
Figure 20.1: Select the references you want to download.
By choosing "Select curated database", you can choose to download a database which is optimized for balance in the taxonomic representation across the taxonomy, i.e. the oversampling of some branches of the taxonomy is removed by using representative sequences. This has the consequence that some assemblies may not be particularly good assemblies, yet they are included as they constitute the best current representative of the given branch in the taxonomy. For this optimized database you can choose to download the full database, or one that is optimized for running the Taxonomic Profiling tool on a laptop computer with 16GB of main memory. The two versions of the curated database contain the same assemblies, but the database that is adapted for running on a system with 16GB of main memory does not contain contig sequences shorter than 250,000 bp.
If you select "Create custom reference database", then the section "Customize Database" will become active. Select one of the three options:
- Include all: The database will contain both genomic and plasmid sequences.
- Include only plasmids: The database will contain only plasmid sequences.
- Exclude all plasmids: The database will not contain any plasmid sequences.
Click Next to select the source of the database you wish to generate (as in figure 20.2)
Figure 20.2: Choose the type of reference you want for your custom database.
You can choose from:
- Prokaryotes: Bacteria and/or Archaea
- Eukaryotes: Fungi and/or Protozoa
- Virus. Note that downloading choosing this option will result in both virus and bacterial assemblies. Indeed, viruses are identified according to their BioProject ID, but this ID also refers to bacterial assemblies that were sequenced together with the virus. Filtering the table on taxonomy will allow you to only see viruses.
- Provide a list of Genbank accession numbers in the white field, or
- Browse your computer for a file with accession numbers, or
- Browse the Navigation Area of the workbench for a sequence list. The corresponding references will be appended to the downloaded sequence list automatically.
The time it will take to download the data (such as assembly summaries, genome report) depends on how many databases are downloaded and the bandwidth of your internet connection. No sequence data is downloaded at this point.
The tool will open a table called a Database builder (figure 20.3) from which you can design your own database. A series of functionality can help you filter and sort the table to extract the information relevant to your project.
Figure 20.3: Output table from the Download Microbial Reference Database tool.
- Use the "Quick selection" button to quickly select predefined subsets for
download:
- Single scaffold complete genomes in RefSeq
- Complete genomes in RefSeq
- All complete genomes
- Aggregate the table to a specified taxonomic group using the drop down menu in the "Data" palette of the side panel. Use the category "Name" to de-aggregate the table.
- Use filter(s) to keep only the rows you are interested in, and click on the button "Include all" to create a database with the remaining rows.
- Alternatively, click on "Include all" rows first, set one or several filters, and use the button "Exclude all" on the remaining rows. Clear the filter(s) by clicking on the red buttons next to each filter set. The rows not filtered away in the second step should still be checked.
Once the table contains all desired rows, click Download selection. Close to the button, you can check how many references are selected and see an estimate of the total size of the selection.
The dialog shown in figure 20.4 allows you to include all annotation tracks (note that these are not needed for taxonomic profiling applications), as well as to set an additional filter "Minimum contig length" (except if you have selected the curated database). It also warns about the memory and disk requirements that will be needed to later run the Taxonomic Profiling tool with the database you are about to download.
Figure 20.4: The Download selection wizard.