Identify and Annotate Variants (WES)

The "Identify and Annotate Variants" tool should be used to identify and annotate variants in one sample. The tool consists of a workflow that is a combination of the "Identify Variants" and the "Annotate Variants" workflows.

The tool runs an internal workflow, which starts with mapping the sequencing reads to the human reference sequence. Then it runs a local realignment to improve the variant detection, which is run afterwards. After the variants have been detected, they are annotated with gene names, amino acid changes, conservation scores, information from clinically relevant variants present in the COSMIC and ClinVar database, and information from common variants present in the common dbSNP, HapMap, and 1000 Genomes database. Furthermore, a detailed mapping report or a targeted region report (whole exome and targeted amplicon analysis) is created to inspect the overall coverage and mapping specificity.

Import your targeted regions

A file with the genomic regions targeted by the amplicon or hybridization kit is available from the vendor of the enrichment kit and sequencing machine. To obtain this file you will have to get in contact with the vendor and ask them to send this target regions file to you. You will get the file in either .bed or .gff format.

To import the file:

        Go to the toolbar | Import (Image Next_Folder_16_n_p) | Tracks (Image import_tracks_16_d_p)

How to run the "Identify and Annotate Variants" ready-to-use workflow

  1. Go to the toolbox and double-click on the "Identify and Annotate Variants" ready-to-use workflow (figure 13.42).

    Image run_identify_and_filter_variants_wes
    Figure 13.42: The ready-to-use workflows are found in the toolbox.

    This will open the wizard shown in figure 13.43 where you can select the sequencing reads from the sample that should be analyzed.

    Image annotate_and_filter_variants_wizardstep1_wes
    Figure 13.43: Please select all sequencing reads from the sample to be analyzed.

    If several samples should be analyzed, the tool has to be run in batch mode. This is done by selecting "Batch" (tick "Batch" at the bottom of the wizard as shown in figure 13.43) and select the folder that holds the data you wish to analyse. If you have your sequencing data in separate folders, you should choose to run the analysis in batch mode.

    When you have selected the sample(s) you wish to prepare, click on the button labeled Next.

  2. In the next wizard step (figure 13.44) you can select the population from the 1000 Genomes project that you would like to use for annotation.

    Image identify_and_annotate_variants_step3_wes
    Figure 13.44: Select the population from the 1000 Genomes project that you would like to use for annotation.

  3. In the next wizard (figure 13.45) you can select the target region track and specify the minimum read coverage that should be present in the targeted regions.

    Image identify_and_annotate_variants_step4_wes
    Figure 13.45: Select the track with targeted regions from your experiment.

  4. Click on the button labeled Next, which will take you to the next wizard step (figure 13.46). In this dialog, you have to specify the parameters for the variant detection. For a description of the different parameters that can be adjusted in the variant detection step, we refer to the description of the "Low Frequency Variant Detection" tool in the CLC Cancer Research Workbench user manual (http://www.clcsupport.com/clccancerresearchworkbench/current/index.php?manual=Low_Frequency_Variant_Detection.html). As general filters are applied to the different variant detectors that are available in CLC Cancer Research Workbench, the description of the filters are found in a separate section called "Filters" (see http://www.clcsupport.com/clccancerresearchworkbench/current/index.php?manual=Filters.html). If you click on "Locked Settings", you will be able to see all parameters used for variant detection in the ready-to-use workflow.

    Image identify_and_annotate_variants_step5_wes
    Figure 13.46: Specify the parameters for variant calling.

  5. Click on the button labeled Next, which will take you to the next wizard step (figure 13.47). In this dialog you can specify the target regions track. The variants found outside the targeted region will be removed at this step in the workflow.

    Image identify_and_annotate_variants_step6_wes
    Figure 13.47: In this wizard step you can specify the target regions track. Variants found outside these regions will be removed.

  6. Click on the button labeled Next, which will take you to the next wizard step (figure 13.48). Once again, select the relevant population from the 1000 Genomes project. This will add information from the 1000 Genomes project to your variants.

    Image identify_and_annotate_variants_step7_wes
    Figure 13.48: Select the relevant population from the 1000 Genomes project. This will add information from the 1000 Genomes project to your variants.

  7. Click on the button labeled Next, which will take you to the next wizard step (figure 13.49). At this step you can select a population from the HapMap database. This will add information from the Hapmap database to your variants.

    Image identify_and_annotate_variants_step8_wes
    Figure 13.49: Select a population from the HapMap database. This will add information from the Hapmap database to your variants.

  8. In this wizard step (figure 13.50) you get the chance to check the selected settings by clicking on the button labeled Preview All Parameters. In the Preview All Parameters wizard you can only check the settings, it is not possible to make any changes at this point.

    Image identify_and_annotate_variants_step9_wes
    Figure 13.50: Check the settings and save your results.

  9. Choose to Save your results and press Finish.

    Note! If you choose to open the results, the results will not be saved automatically. You can always save the results at a later point.

Output from the Identify and Annotate Variants workflow

The "Identify and Annotate Variants" tool produces several outputs.

Please do not delete any of the produced files alone as some of them are linked to other outputs. Please always delete all of them at the same time.

A good place to start is to take a look at the mapping report to see whether the coverage is sufficient in the regions of interest (e.g. > 30 ). Furthermore, please check that at least 90% of the reads are mapped to the human reference sequence. In case of a targeted experiment, please also check that the majority of the reads are mapping to the targeted region.

Next, open the Genome Browser View file (see figure 13.51).

The Genome Browser View includes a track of the identified annotated variants in context to the human reference sequence, genes, transcripts, coding regions, targeted regions, mapped sequencing reads, clinically relevant variants in the COSMIC and ClinVar database as well as common variants in common dbSNP, HapMap, and 1000 Genomes databases.

Image annotate_and_filter_variants_result1_wes
Figure 13.51: Genome Browser View to inspect identified variants in the context of the human genome and external databases.

To see the level of nucleotide conservation (from a multiple alignment with many vertebrates) in the region around each variant, a track with conservation scores is added as well.

By double-clicking on the annotated variant track in the Genome Browser View, a table will be shown that includes all variants and the added information/annotations (see  13.52).

Image annotate_and_filter_variants_result2_wes
Figure 13.52: Genome Browser View with an open track table to inspect identified somatic variants more closely in the context of the human genome and external databases.

The added information will help you to identify candidate variants for further research. For example can known cancer associated variants (present in the COSMIC database) or variants known to play a role in drug response or other clinical relevant phenotypes (present in the ClinVar database) easily be seen.

Not identified variants in COSMIC and ClinVar, can for example be prioritized based on amino acid changes (do they cause any changes on the amino acid level?). A high conservation level on the position of the variant between many vertebrates or mammals can also be a hint that this region could have an important functional role and variants with a conservation score of more than 0.9 (PhastCons score) should be prioritized higher. A further filtering of the variants based on their annotations can be facilitated using the table filter on top of the table.

If you wish to always apply the same filter criteria, the "Create new Filter Criteria" tool should be used to specify this filter and the "Identify and Annotate" workflow should be extended by the "Identify Candidate Tool" (configured with the Filter Criterion). See the reference manual for more information on how preinstalled workflows can be edited.

Please note that in case none of the variants are present in COSMIC, ClinVar or dbSNP, the corresponding annotation column headers are missing from the result.

In case you like to change the databases as well as the used database version, please use the "Data Management".