References management

The Reference Data Manager (figure 9.1) offers an easy way to find and download reference data provided by QIAGEN and some other public resources, and to manage reference data in a controlled manner.

When data is downloaded using the Reference Data Manager, it is stored in a folder in your home area called CLC_References. If such a folder does not already exist, it will be created and added as a Workbench File Location automatically when you first start up the CLC Genomics Workbench. Data can only be added to or deleted from this area using the Reference Data Manager, providing an extra level of control over the reference data in use at any given time.

The location reference data is downloaded to in the CLC Genomics Workbench can be changed. Alternatively, a reference data location on a CLC Genomics Server can be used. Reference data can take a lot of space, so it can be desirable to use a different location than the default, and both of these routes support the use of a shared reference data location. Further details are provided below.

In addition to data downloaded via the Reference Data Manager, your own reference data can be imported such that it is put under its control. This is described on http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Imported_Data.html.

Downloading data using the Reference Data Manager

To open the Reference Data Manager, click on the References (Image references_icons) button in the top Toolbar or go to the Utilities menu and select Manage Reference Data (Image references_icons).

Image datamanagementtool
Figure 9.1: The Reference Data Manager offers access to data from several sources. Here, the Download Genomes tab has been selected, opening options for downloading reference data for various organisms. Details about the data available is provided in the right hand pane.

The total size of the reference data is indicated when selecting the elements to download. The amount of time it will take to download this data depends on your network connection, but can take several hours on slower connections.

Download Genomes Data under this tab is downloaded directly from public repositories such as Ensembl, NCBI, and UCSC. It is not provided by nor hosted by QIAGEN.

QIAGEN Sets Data under this tab is provided by QIAGEN. It is organized into Reference Sets, which are sets of data elements often used together. For example, a given Reference Set might have the hg38 reference sequence and genes, CDS, and other data types based on that reference genome.

If your site hosts a mirror of the QIAGEN reference data repository, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Advanced_preferences.html for information about how to configure your Workbench to refer to that location.

Downloading reference data locally or to a CLC Server

By default, a reference data location on your CLC Genomics Workbench is used. This is indicated by the option "Locally" being selected for where to "Manage Reference Data" in the top, right hand side of the Reference Data Manager. The amount of free space available is reported just below this (figure 9.2).

Image reference_local
Figure 9.2: Reference data is downloaded to the CLC_References area of the Workbench when the "Manage Reference Data" option is set to "Locally".

When you are logged into a CLC Genomics Server that has been configured with a File System Location called CLC_References, you will be able to select the option "On Server". When selected, reference data is downloaded to the CLC_References area on the CLC Server. By default, it is downloaded directly to the CLC Server, but downloads can be configured to go via the Workbench instead using settings within the Workbench Preferences. This can be useful if the CLC Server does not have access to the external network but the CLC Genomics Workbench does. See http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Advanced_preferences.html for information about how to configure Workbench preferences.

If you have chosen "On Server" and your CLC Genomics Server is set up to send jobs to grid nodes, you can choose which grid preset to use for downloading data under the Download Genomes tab via a drop-down menu to the left of the Download button (figure 9.3).

Image reference_server
Figure 9.3: Reference data will be downloaded to the CLC_References area on a CLC Genomics Server when the "On Server" option is selected. For grid setups, the grid preset to use when downloading data under the Download Genomes tab can be selected, as shown here.

Changing the reference data location

You can specify a different location to download reference data to on the CLC Genomics Workbench. This is recommended if you do not have enough space in the default area. To do this, go to the Navigation Area and:

        Right-click on the folder "CLC_References" | Choose "Location" | Choose "Specify Reference Location"

The selected folder will be registered as the location to associate with the Workbench CLC_References Location. This means that data downloaded via the Reference Data Manager will be stored in that folder, and that when workflows contain data inputs configured using workflow roles, it is under this folder the relevant data elements will be expected to be. See http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Configuring_input_output_elements.html for more information about workflow roles.

This does not remove the old CLC_References folder or its contents. Standard system tools should be used to delete these items if they are no longer needed.

Reference data on non-networked systems

If the CLC Genomics Workbench is installed on systems without access to the external network, the following steps can be followed to import reference data to the non-networked Workbench:

  1. Install CLC Genomics Workbench on a machine with access to the external network.
  2. Download an evaluation license via the Workbench License Manager. If you have problems obtaining an evaluation license this way, please write to us at ts-bioinformatics@qiagen.com.
  3. Use the Reference Data Manager on the networked Workbench to download the reference data of interest. By default, this would be downloaded to a folder called CLC_References.
  4. When the download is completed, copy the CLC_References folder and all its contents to a location where the machines with the CLC software installed can access it.
  5. Get the software to refer to that folder for reference data: in the Navigation Area of the non-networked Workbench, right click on the CLC_References, and choose the option "Specify Reference Location...". Choose the folder you imported from the networked Workbench and click Select.

You can then access reference data using the Reference Data manager.



Subsections