Download and configure reference data

The first time you open Biomedical Genomics Workbench you will be presented with the dialog box shown in figure 12.2, which informs you that data are available for download either to the local or server CLC_References repository. If you check the "Never show this dialog again" then subsequently you will only be presented with the dialog box when updated versions of the reference data are available.

Image alert_new_references_available
Figure 12.2: Notification that new versions of the reference data are available.

Click on the button labeled Yes. This will take you to the wizard shown in figure 12.3.

Image manage_reference_data_beforedownload_all
Figure 12.3: The Manage Reference Data wizard gives access to the reference data that are required to be able to run the ready-to-use workflows.

This wizard can also be accessed from the upper right corner of the Biomedical Genomics Workbench by clicking on Data Management (Image search_database_16_h_p) (figure 12.4).

Image datamanagement_button
Figure 12.4: Click on the button labeled "Data management" to open the "Manage Reference Data" dialog where you can download and configure the reference data that are necessary to be able to run the ready-to-use-workflows.

The "Manage Reference Data" wizard gives access to all the reference data that are used in the ready-to-use workflows. From the wizard you can download and configure the reference data.

In the upper part of the wizard you can find two tiles called "QIAGEN Reference Data Library" (Image library_data_management) and "Custom Reference Data Sets" (Image custom_data_management).

On the left hand side, you can use the drop-down menu to choose where you want to manage the reference data. If you choose "Locally", the Download, Delete and Apply buttons will work on the local reference data. If you choose "On Server" (only available if you are connected to the server), the buttons will work on the reference data on the server you are connected to(figure 12.5).

Image reference_local_server
Figure 12.5: Reference data can be available locally or on the server.

You can also check how much free space is available for the Reference folder on your local disk or on the server. The drop-down menu also allows you to check which datasets have been downloaded locally or on the server. You can see this in the left panel of the reference data manager.

When on the "QIAGEN Reference Data Library" tile, we can see the list of all available references data under 2 headers: Reference Data Sets and Reference Data Elements. Two icons indicate whether you have already downloaded your data in your Reference folder (Image check_data_management) or not (Image plus_data_management).

When selecting a reference set or an element, the window on the right show the size of the folder as well as some complementary information about the reference database. For Reference Data Sets, a table recapitulates the elements included in the set with their version number and respective size, as well as a list of the workflows affected by the set.

Here is the list of the Reference Data Sets and their approximate size:

Each Reference Data Set is made of a compilation of Reference Data Elements. Downloading sets will automatically download the elements the set is made of, but you can also download elements individually under the Reference Data Elements folder. Note that data for hg19 is available for the whole genome as well as for individual chromosome 5, 14, 17, 21 and 22.

Data that has not been downloaded yet is represented by a plus icon (Image plus_data_management). Select the set or element you would like to download, and click on the Download button. Once the data is downloading, the Download button fades out and you can check the progress of the downloading in the Processes tab below the toolbox (fig 12.6).

Image download_progress
Figure 12.6: Click on the info button to see the legal notice and license information.

Once the reference data has been downloaded, the set or element is marked with a check icon (Image check_data_management).

If you have finished downloading the appropriate Reference Data Set, click on the button labeled Apply and the workflows will automatically be configured with all the relevant reference data available. The information in the "Applied" column in the right panel of the reference data manager describes whether the dataset has been applied to the location specified in the drop-down menu. For example, a "Yes" in the "Applied" column when the drop-down menu is set to "On Server" means that the given data will be used from the server, when the affected workflows are run. This will be the case even if you choose execute the workflow locally (i.e. in the workbench). If the "Applied" column contains "Yes" when the drop-down menu is set to "Locally", this means the given data will be used from the local reference folder, when the affected workflows are run. This means that you will not be able to execute these workflows on the server (fig 12.7).

Image applied_server
Figure 12.7: Check where your reference data is applied by looking at the column "Applied" in the data set description. .

For references like the "1000 Genomes Project" and "HapMap" databases which contain more than one reference data file, the workflow will initially be configured with all the populations being available and you will be able to specify which reference data to use in the workflow wizard directly.

But you can also modify a pre-existing Reference Data Set to contain only the population you want to work with. In the Data Management wizard, select the Reference Data Set you are interested in, click on Create Custom Set. Select the version of the 1000 genomes or Hapmap database you wish to work with (fig 12.8).

Image custom_population
Figure 12.8: Select the version of the 1000 genomes or Hapmap database you want to work with, or select the option "custom".

A pop-up window will open where you can select the population you want to work with. Alternatively, click on the option "custom" in lieu of version and choose from the CLC_References folder the population of your choice (fig 12.9).

Image reference_population
Figure 12.9: Select the version of the 1000 genomes or Hapmap database you want to work with, or select the option "custom".

Three letter codes are used to specify the population that the different reference data origin from (e.g. ASW = American's of African Ancestry in SW USA). For the phase 3 HapMap population codes, please see http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html and for the 1000 Genomes Project see http://www.ensembl.org/Help/Faq?id=328.

The Delete button allows user to delete locally installed reference data, whereas only administrators are capable of deleting reference data installed on the server. This can be used if you suspect that a downloaded reference is corrupt, and needs to be re-downloaded, or if you need to clean up space, e.g. locally.

Note: Custom reference data sets specific to the workbench on which they are created, and will not appear in other workbenches connected to the same server.

At the bottom of the wizard you can find: