References management

The Reference Data Manager (figure 8.1) offers an easy way of retrieving popular reference data sources such as genes, variant annotations and genome sequences as tracks.

Image datamanagementtool
Figure 8.1: Click on the References button to open the Reference Data Manager. Here, public reference data and QIAGEN Sets can be downloaded, and custom data sets can be imported and configured.

The total size of the reference data is indicated when selecting the elements to download. The amount of time it will take to download this data depends on your network connection, but can take several hours on slower connections.

Where reference data is downloaded from

Data download using the Download Genomes comes from public repositories such as Ensembl, NCBI, UCSC. This type of data is not provided by nor hosted by QIAGEN. The list of organisms is dynamically updated.

The QIAGEN Sets tab allows you to download curated Reference Data Sets directly from a QIAGEN reference data repository. The location of the repository can be changed within the Preferences of the Workbench. This is only relevant if your site is hosting a mirror of this area.

Downloading reference data

By default, data downloaded using the Reference Data Manager is stored in a folder in your home area called CLC_References. If such a folder does not already exist, it will be created and added as a Workbench File Location automatically when you first start up the CLC Genomics Workbench.

In the top right hand side of the Reference Data Manager, the option "Locally" next to "Manage Reference Data" indicates that data will be downloaded to the CLC Genomics Workbench. The amount of free space available is reported just below this (figure 8.2).

Image reference_local
Figure 8.2: Reference data is downloaded to the CLC_References area of the Workbench when the "Manage Reference Data" option is set to "Locally".

Downloading data to a CLC_References area on a server

If you are logged into a CLC Genomics Server that has been configured with a File System Location called CLC_References, then the "Manage Reference Data" drop-down menu in the top right corner of the Reference Data Manager will show the option "On Server". If this is selected, reference data will be downloaded to the CLC_References area on the server.

If you have chosen "On Server" and your CLC Genomics Server is set up to send jobs to grid nodes, you can choose which grid preset to use for downloading data under the Download Genomes tab via a drop-down menu to the left of the Download button (figure 8.3).

Image reference_server
Figure 8.3: Reference data will be downloaded to the CLC_References area on a CLC Genomics Server when the "On Server" option is selected. For grid setups, the grid preset to use when downloading data under the Download Genomes tab can be selected, as shown here.

By default, data will be downloaded directly using the the CLC_References location on the server. Downloads can be configured to go via the Workbench using settings within the Workbench Preferences. This can be useful if the CLC Genomics Server does not have access to the external network but the CLC Genomics Workbench does.

Changing the reference data location

You can specify a different location to download reference data to on the CLC Genomics Workbench. This is recommended if you do not have enough space in the default area. To do this, go to the Navigation Area and:

        Right-click on the folder "CLC_References" | Choose "Location" | Choose "Specify Reference Location"

If it does not already exist, this folder will be created. It is then registered as the place to download reference data to. Workflows configured to use particular Reference Data Sets will look in this new location for those.

This action does not remove the old CLC_References folder or its contents. Standard system tools should be used to delete these items if they are no longer needed. Alternatively, this data can be moved to the new location using standard system tools.

Reference data on non-networked systems

CLC Genomics Workbench may be installed on systems without access to the external network. In that case, the following steps can be followed to import reference data to the non-networked Workbench:

  1. Install CLC Genomics Workbench on a machine with access to the external network.
  2. Download an evaluation license via the Workbench License Manager. If you have problems obtaining an evaluation license this way, please write to us at [email protected].
  3. Use the Reference Data Manager on the networked Workbench to download the reference data of interest. By default, this would be downloaded to a folder called CLC_References.
  4. When the download is completed, copy the CLC_References folder and all its contents to a location where the machines with the CLC software installed can access it.
  5. Get the software to refer to that folder for reference data: in the Navigation Area of the non-networked Workbench, right click on the CLC_References, and choose the option "Specify Reference Location...". Choose the folder you imported from the networked Workbench and click Select.

You can then access reference data using the Reference Data manager.