If you are analyzing a list of variants that have been detected in a tumor or blood sample where no control sample is available from the same subject, you can use the Filter Somatic Variants (WGS) template workflow to identify potential somatic variants. The purpose of this template workflow is to use publicly available (or your own) databases, with common variants in a population, to extract potential somatic variants whenever no control/normal sample from the same subject is available.
This workflow accepts variant tracks () (e.g. the output from the Identify Variants template workflow) as input. Variants that are identical to the human reference sequence are first filtered away and then variants found in the Common dbSNP, 1000 Genomes Project, and HapMap databases are deleted. Variants in those databases are assumed to not contain relevant somatic variants.
Please note that this tool will likely also remove inherited cancer variants that are present at a low percentage in a population.
Next, the remaining somatic variants are annotated with gene names, amino acid changes, conservation scores and information from ClinVar (known variants with medical impact) and dbSNP (all known variants).
To run the Filter Somatic Variants (WGS) workflow, go to:
Toolbox | Template Workflows | Biomedical Workflows () | Whole Genome Sequencing () | Somatic Cancer () | Filter Somatic Variants ()
- Double-click on the Filter Somatic Variants (WGS) to start the analysis. If you are connected to a server, you will first be asked where you would like to run the analysis.
- Next, you will be asked to select the variant track you would like to use for filtering somatic variants (figure 19.13).
- In the next dialog, you have to select which data set should be used to filter somatic variants (figure 19.14).
- The 1000 Genomes population(s) is bundled in the downloaded reference dataset and therefore preselected in this step. If you want to use another variant track you can browse in the navigation area for the preferred track to use (figure 19.15).
- For databases that provide data from more than one population as HapMap does, the populations relevant to the data set can be specified. Click on the plus symbol () and choose the population that matches the population your samples are derived from (figure 19.16). Please note that different populations are available and can be downloaded via the Reference Data Manager found in the top right corner of the CLC Workbench.
- In the last wizard step you can check the selected settings by clicking on the button labeled Preview All Parameters.
In the Preview All Parameters wizard you can only check the settings, and if you wish to make changes you have to use the Previous button from the wizard to edit parameters in the relevant windows.
- Choose to Save your results and click Finish.
Two types of output are generated:
- Somatic Candidate Variants Track that holds the variant data. This track is also included in the Track List. If you hold down the Ctrl key (Cmd on Mac) while clicking on the table icon in the lower left side of the View Area, you can open the table view in split view. The table and the variant track are linked together, and when you click on a row in the table, the track view will automatically bring this position into focus.
- Track List Filter Somatic Variants A collection of tracks presented together. Shows the somatic candidate variants together with the human reference sequence, genes, transcripts, coding regions, and variants detected in ClinVar, 1000 Genomes, and the PhastCons conservation scores (see figure 19.17).
The track with the conservation scores allows you to see the level of nucleotide conservation (from a multiple alignment with many vertebrates) in the region around each variant. Mapped sequencing reads as well as other tracks can be easily added to the Track List.
Open the variant track as a table showing all variants and the added information/annotations (see figure 19.18.
Adding information from other sources may help you identify interesting candidate variants for further research. E.g. common genetic variants (present in the HapMap database) or variants known to play a role in drug response or other relevant phenotypes (present in the ClinVar database) can easily be identified. Further, variants not found in the ClinVar databases, can be prioritized based on amino acid changes in case the variant causes changes on the amino acid level.
A high conservation level, between different vertebrates or mammals, in the region containing the variant, can also be used to give a hint about whether a given variant is found in a region with an important functional role. If you would like to use the conservation scores to identify interesting variants, we recommend that variants with a conservation score of more than 0.9 (PhastCons score) is prioritized over variants with lower conservation scores.
It is possible to filter variants based on their annotations. This type of filtering can be facilitated using the table filter found at the top part of the table. If you are performing multiple experiments where you would like to use the exact same filter criteria, you can include in a workflow the Filter on Custom Criteria tool configured with the desired set of criteria.