Download Reference Genome Data
The CLC Genomics Workbench offers an easy way of retrieving popular reference data sources such as genes, variant annotations and genome sequences as tracks. The data itself is not provided or hosted by QIAGEN. The workbench only provides an easy way to retrieve data that should otherwise have been downloaded and imported:
Legacy Tools | Download Reference Genome Data ()
This opens the dialog shown in figure 33.3:
Figure 33.3: Selecting an organism for download.
Select the organism of interest and click Next. The list of organisms is dynamically updated by QIAGEN independently of Workbench versions, so you will always see the most recent list of organisms.
The list just shows a selection of some of the popular organisms - if you do not find what you are looking for, there is always the possibility to download and import the data as well.
Once you have clicked Next, you will be asked whether you wish to download the genome sequence or whether you already have it available as shown in figure 33.4:
Figure 33.4: Selecting a reference genome sequence if available.
If you do not already have an existing sequence imported into the CLC Genomics Workbench, it will be downloaded automatically from Ensembl. If you already have a reference sequence, it has to match the genome definition built into the download tool. This means that the name and length of the chromosomes in your reference sequence have to match the genome definition of the tool.
Clicking Next allows you to select which types of annotation data you wish to download as shown in figure 33.5:
Figure 33.5: Selecting different types of annotation data for human hg19.
This step is different from organism to organism and depends on the data sources that QIAGEN has included for download. In the example in figure 33.5 showing hg19 build of the human genome, a lot of variant data is available from e.g. dbSNP. Please note that for human data, a difference between the UCSC genome build and Ensembl/NCBI exists, which means that variants downloaded from UCSC will not be annotated on the mitochondrial genome when using this download tool (see http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19).
Both the data provider and the size of the data is listed in the dialog, and at the bottom you can see the size of all the selected downloads. Please note that the file size displayed in the setup window for the Download Genomes tool refers to the size of the compressed text files, which the tool is retrieving from the provider's depository. The size of the track objects will be, after decompression and conversion from text to the .clc track format, notably larger.
All data downloaded with this tool will be tracks (either sequence tracks or various kinds of annotation tracks). The Ensemble human gene annotation download will produce various tracks describing the chromosomal positions of features such as exons, genes, untranslated region (UTRs), transcripts, selenocysteines, coding sequences (CDSs) and mRNAs.