Download and configure reference data
The first time you open Biomedical Genomics Workbench you will be presented with the dialog box shown in figure 11.2, which informs you that data are available for download either to the local or server CLC_References repository. If you check the "Never show this dialog again" then subsequently you will only be presented with the dialog box when updated versions of the reference data are available.
Figure 11.2: Notification that new versions of the reference data are available.
Click on the button labeled Yes. This will take you to the wizard shown in figure 11.3.
Figure 11.3: The Manage Reference Data wizard gives access to the reference data that are required to be able to run the ready-to-use workflows.
This wizard can also be accessed from the upper right corner of the Biomedical Genomics Workbench by clicking on Data Management () (figure 11.4).
Figure 11.4: Click on the button labeled "Data Management" to open the "Manage Reference Data" dialog where you can download and configure the reference data that are necessary to be able to run the ready-to-use-workflows.
The "Manage Reference Data" wizard gives access to all the reference data that are used in the ready-to-use workflows and in the tutorials. From the wizard you can download and configure the reference data.
In the upper part of the wizard you can find two tiles called "QIAGEN Reference Data Library" () and "Custom Reference Data Sets" ().
On the left hand side, you can use the drop-down menu to choose where you want to manage the reference data. If you choose "Locally", the Download, Delete and Apply buttons will work on the local reference data. If you choose "On Server" (only available if you are connected to the server), the buttons will work on the reference data on the server you are connected to(figure 11.5).
Figure 11.5: Reference data can be available locally or on the server.
You can also check how much free space is available for the Reference folder on your local disk or on the server. The drop-down menu also allows you to check which datasets have been downloaded locally or on the server. You can see this in the left panel of the reference data manager.
When on the "QIAGEN Reference Data Library" tile, we can see the list of all available references data under 4 headers: Reference Data Sets, Reference Data Elements, Tutorial Reference Data Sets and Tutorial Reference Data Elements. Two icons indicate whether you have already downloaded your data in your Reference folder () or not ().
When selecting a reference set or an element, the window on the right show the size of the folder as well as some complementary information about the reference database. For Reference Data Sets, a table recapitulates the elements included in the set with their version number and respective size, as well as a list of the workflows affected by the set.
Here is the list of the Reference Data Sets and their approximate size: Reference Data Sets
- hg38 96 GB with Ensembl v81, dbSNP v142, ClinVar 20150901
- hg38 88 GB with Ensembl v80, dbSNP v142, ClinVar 20150629
- hg19 63 GB with Ensembl v74, dbSNP v138, ClinVar 20131203
- QIAGEN Gene Reads Panels hg19 8 MB with Ensembl v74
- Mouse 15 GB with Ensembl v80
- Rat 5.5 GB with Ensembl v79
Tutorial Reference Data Sets
- chr 5 of hg19 4.5 GB for use with the Identification of Variants in a Tumor Sample tutorial
- chr 14 of hg19 2.3 GB for use with the Copy Number Variant Detection tutorial
- chr 17 of hg19 2 GB for use with the RNA-Seq Analysis of Human Breast Cancer Data tutorial
- chr 21 of hg19 1 GB for use with the ChIP Sequencing tutorial
- chr 22 of hg19 1 GB for use with the Identification of Somatic Variants in a Matched Tumor-Normal Pair tutorial
Each Reference Data Set is made of a compilation of Reference Data Elements. Downloading sets will automatically download the elements the set is made of, but you can also download elements individually under the Reference Data Elements folder.
- For homo sapiens
- Sequence hg38
- Sequence hg19 (whole genome and chromosome specific)
- dbSNP 142
- dbSNP 138 (whole genome and chromosome specific)
- dbSNP Common 142
- dbSNP Common 138 (whole genome and chromosome specific)
- Hapmap phase_3_ensembl_v80, Hapmap phase_3 (whole genome and chromosome specific)
- Genes ensembl_v80, ensembl_v73, ensembl_v74 (whole genome and chromosome specific)
- Conservation Scores PhastCons hg38
- Conservation Scores PhastCons hg19 (whole genome and chromosome specific)
- ClinVar 20150629 and 20130930 (whole genome and chromosome specific), 20131203 (whole genome and chromosome specific)
- 1000 Genomes Project phase_3 and phase_1 (whole genome and chromosome specific)
- Gene Ontology 20150630 and 20131027 (whole genome and chromosome specific)
- CDS ensembl_v80 and ensembl_v74 (whole genome and chromosome specific)
- mRNA ensembl_v80 and ensembl_v74 (whole genome and chromosome specific)
- Target Regions qiagen_v2.01_hg38, Target Regions qiagen_v2.01 (whole genome and chromosome specific) and qiagen_v2 (whole genome and chromosome specific)
- Target Primers qiagen_v2.01_hg38, qiagen_v2.01 (whole genome and chromosome specific), qiagen_v2 (whole genome and chromosome specific)
- For mus musculus
- CDS ensemb_v80
- Conservation Scores Phastcons mm 10
- dbSNP ensembl_v80
- Gene Ontology 20150630
- Genes ensembl_v80
- mRNA ensembl_v80
- Sequence ensemble_v80
- For rattus norvegicus
- CDS ensemb_v79
- Conservation Scores Phastcons Rnor_5.0
- dbSNP ensembl_v79
- Gene Ontology 20150630
- Genes ensembl_v79
- mRNA ensembl_v79
- Sequence ensemble_v79
Data that has not been downloaded yet is represented by a plus icon (). Select the set or element you would like to download, and click on the Download button. Once the data is downloading, the Download button fades out and you can check the progress of the downloading in the Processes tab below the toolbox (figure 11.6).
Figure 11.6: Click on the info button to see the legal notice and license information.
Once the reference data has been downloaded, the set or element is marked with a check icon ().
If you have finished downloading the appropriate Reference Data Set, click on the button labeled Apply and the workflows will automatically be configured with all the relevant reference data available. The information in the "Applied" column in the right panel of the reference data manager describes whether the dataset has been applied to the location specified in the drop-down menu. For example, a "Yes" in the "Applied" column when the drop-down menu is set to "On Server" means that the given data will be used from the server, when the affected workflows are run. This will be the case even if you choose execute the workflow locally (i.e. in the workbench). If the "Applied" column contains "Yes" when the drop-down menu is set to "Locally", this means the given data will be used from the local reference folder, when the affected workflows are run. This means that you will not be able to execute these workflows on the server (figure 11.7).
Figure 11.7: Check where your reference data is applied by looking at the column "Applied" in the data set description.
The Reference Data Sets also contain a Create Custom Set ... button that allows you to create your own set of reference data starting from an existing data set (see Create a custom Reference Data Set).
The Delete button allows user to delete locally installed reference data, whereas only administrators are capable of deleting reference data installed on the server. This can be used if you suspect that a downloaded reference is corrupt, and needs to be re-downloaded, or if you need to clean up space, e.g. locally.
At the bottom of the wizard you can find:
- A button "Help" button that links to the section in the Biomedical Genomics Workbench reference manual that describes the "Manage Reference Data" button.
- A button labeled "Close". Click on this to close the wizard.