Download Genomes
Download Genomes provides access to genomes and associated reference data, such as annotations and known variants, from public repositories (figure 11.2)11.1.
To access Download Genomes, open the Reference Data Manager by clicking on the References () button in the top Toolbar or go to the Utilities menu and select Manage Reference Data (). Then click on the Download Genomes tab at the top, left.
Click on an organism name to select it. Information about the data available for download for that organism is then shown in the right hand panel, including a name describing the data, the data provider, a version (if available), and the size (figure 11.2).
Figure 11.2: Under the Download Genomes tab, reference sequences and associated reference data for a variety of organisms can be downloaded from public repositories.
Searching for data available under the Download Genomes tab
Use the search field under the top toolbar to search for terms in the names of organisms, resources and resource providers. To search for just an exact term, put the term in quotes.
The results include the name of the element or set the term was found in, followed in brackets by the tab it is listed under. Hover the cursor over a hit to see what aspect of the result matched the search term (figure 11.3). Double-click on a search result to open it.
Figure 11.3: When the Download Genomes tab is selected, terms entered in the search field are searched for in the names of organisms, resources and resource providers. Hovering the cursor over a hit reveals a tooltip with information about the match.
Downloading resources
To download resources, select the data types of interest and click on the Download button. "Sequence" is always selected.
When reference data is stored on a CLC Server with grid nodes, the grid preset to use to download data can be specified via a drop-down menu to the left of the Download button (figure 11.4).
Figure 11.4: When the "On Server" option is selected and grid presets are configured on the CLC Server, a grid preset to use for the download can be selected.
After download, each file is imported and stored under the CLC_References File System Location. A folder for each downloaded set is created under the "Genomes" folder. Its name contains the species name and the date of the download.
Previous downloads of data for the selected organism are listed in the right hand panel of the Reference Data Manager under "Previous downloads".
To delete downloaded data, select the entries in this list and then click on the Delete button. When reference data is stored on a CLC Server, you need be logged in from the Workbench as an administrative user to delete reference data.
Note: Most data is supplied as a compressed text file. After download, each file is decompressed and the data is imported. CLC data is compressed by default, but the size of the compressed data after import will generally be different to the size reported for the original data file.
Notes about particular data types
When GFF3 files are imported, a track is created for each feature type present in the file (see GFF3 format). In addition, an (RNA) track and a (Gene) track are created. The (RNA) track contains entries for all "RNA" type annotations. I.e. all the children of "mature_transcript", which is the parent of "mRNA", which is the parent of the "NSD_transcript". The (Gene) track contains genes and gene-like annotation types, such as ncRNA_gene, plastid_gene, and tRNA_gene. These broader sets of annotations can make these tracks particularly useful for some types of analyses, e.g. RNA-Seq.
For some genomes, chromosome bands (ideograms) are available (figure 11.5).
Please note that hg18 and hg19 variants downloaded from UCSC do not include variants on the mitochondrial genome.
Figure 11.5: The ideogram is particularly useful when used in combination with other tracks in a track list. In this figure the ideogram is highlighted with a red box.
Footnotes
- ...fig:downloadgenomestab)11.1
- The data listed under the Download Genomes tab is not provided by nor hosted by QIAGEN.