Download and configure reference data

The first time you open CLC Cancer Research Workbench you will be presented with the dialog box shown in figure 11.2, which informs you that data are available for download for either to the local or server CLC_References repository. If you check the "Never show this dialog again" then subsequently you will only be presented with the dialog box when updated versions of the reference data are available.

Image alert_new_references_available
Figure 11.2: Notification that new versions of the reference data are available.

Click on the button labeled Yes. This will take you to the wizard shown in figure 11.3.

Image manage_reference_data_beforedownload_all
Figure 11.3: The Manage Reference Data wizard gives access to the reference data that are required to be able to run the ready-to-use workflows. The default view shows the references that are used in the workflows. With the "Show All" button the reference list can be expanded with additional (optional) reference data that you may find useful.

This wizard can also be accessed from the upper right corner of the CLC Cancer Research Workbench by clicking on Data Management (Image search_database_16_h_p) figure 11.4.

Image datamanagement_button
Figure 11.4: Click on the button labeled "Data management" to open the "Manage Reference Data" dialog where you can download and configure the reference data that are necessary to be able to run the ready-to-use-workflows.

The "Manage Reference Data" wizard gives access to all the reference data that are used in the ready-to-use workflows. From the wizard you can download and configure the reference data. A button labeled "Show All" at the bottom of the dialog can be used to expand the list with additional reference data that are not required for any of the workflows (e.g. Gene Ontology). Rather these extra reference data have been provided as an extra service for those of our users who would like to include information from these databases in the data analyses.

Icons are used in the "Manage Reference Data" wizard to give a quick overview of the current status of each reference: "Not downloaded and / or unconfigured", "Workflows use different versions" or "Selected version is inconsistent / not fully downloaded" references are marked with a red exclamation mark (Image Exclamation_Red_16_n_p), references that are "Up to date and configured" are marked with a green check mark (Image Green_Checkmark_16_n_p), and when a new version of a reference data set is available, you will see a green mark labeled "New" (Image referencemanager_new_16_n_p).

Guide to the "Manage Reference Data" wizard:

If you are connected to a CLC Server you will be asked where you want to save the downloaded reference data, to your Workbench or your Server when you click on the button labeled Download or Download All. See figure 11.9. You will see this dialog the first time you download data. After this the dialog will appear only in situations where both the Local and Server version need updating. If a new version is found with respect to only Local or Server, the data will automatically be downloaded to that location.

Image save_locally_or_on_server
Figure 11.9: Select where to save the downloaded reference data. Please be aware that the total size of all reference data (in April 2014) is about 12 GB when compressed. It can take some time to download all reference data. When unzipped the size of all the reference data, when the compressed size was about 12 GB is about 75 GB.

When the reference data have been downloaded, the workflows will automatically be configured with the reference data. However, in some cases reference data are available from different population subgroups. This is the case for HapMap and the 1000 Genomes Project. Three letter codes are used to specify the population that the different reference data origin from (e.g. ASW = American's of African Ancestry in SW USA). For the phase 3 HapMap population codes, please see http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html and for the 1000 Genomes Project see http://www.ensembl.org/Help/Faq?id=328.

Whenever workflows use reference data that are available from more than one population, the workflow will initially be configured with all the populations being available, and which population to use in the workflow will then need to be specified by the user in one of the wizard steps that appear when starting the workflow. How to configure your workflow with the right population is described in Download and configure reference data.

Figure 11.10 shows the CLC_References folder. If you open the folders holding the reference data, you can see that different populations are available for HapMap and the 1000 Genomes Project.

Image refdata_available_populationcodes
Figure 11.10: For the 1000 Genomes Project and HapMap reference data, data are available from different populations. For these two databases the user must manually specify the relevant population to be used in the workflows. If the user choose not to select a population manually, the workflow will use a randomly selected population.