Analyze Viral Hybrid Capture Panel Data

The Analyze Viral Hybrid Capture Panel Data template workflow is designed for detecting viruses, calculating abundances and for calling variants for the viruses identified from data generated using hybrid capture panels. The workflow (figure 3.1) performs read trimming, creates a QC report, cleans the dataset of host DNA, calculates viral abundances and maps the reads to the most abundant viral reference for variant calling. Note: After taxonomic profiling, the viral reads are downsampled to maximum 500,000 read pairs. Also note: If your panel does not contain control genes, the workflow should be modified by right clicking on the workflow in the tool box and opening a copy of the workflow. Remove the Map Reads to Human Control Genes workflow element plus its input and output, and save the modified workflow. When you run the modified workflow, human control genes are no longer required.

Image viralhybridcapture_wf
Figure 3.1: Analyze Viral Hybrid Capture Panel Data workflow.

Preliminary steps to run the Analyze Viral Hybrid Capture Panel Data workflow

Before starting the workflow,

How to run the Analyze Viral Hybrid Capture Panel Data workflow

To run the workflow, go to:

        Toolbox | Template Workflows (Image workflow_group) | Microbial Workflows (Image mgm_folder_closed_flat_16_h_p) | Metagenomics (Image wma_folder_open_flat_16_n_p) | Taxonomic Analysis (Image taxonomic_analysis_folder_16_n_p) |Analyze Viral Hybrid Capture Panel Data (Image viral_hybrid_capture_16_h_p)

  1. Specify the sample(s) or folder(s) of samples you would like analyze (figure 3.2) and click Next. Note that if you select several items, they will be run as batch units.

    Image viralhybridcapture_input
    Figure 3.2: Select the reads from the sample(s) you would like to analyze

  2. Specify the human control genes as a sequence list here (figure 3.3). Alternatively, if your panel does not contain control genes, the workflow should be modified by right clicking on the workflow in the tool box and opening a copy of the workflow. Then remove the Map Reads to Reference tool plus its input and output and save the modified workflow. When you run the modified workflow, human control genes are no longer required.

    Image viralhybridcapture1
    Figure 3.3: Select the human control genes or reference

  3. Define batch units using organisation of input data to create one run per input or use a metadata table to define batch units. Click Next.
  4. The next wizard window gives you an overview of the samples present in the selected folder(s). Choose which of these samples you want to analyze in case you are not interested in analyzing all the samples from a particular folder (figure 3.4).

    Image viralhybridcapture_batch
    Figure 3.4: Choose which of the samples present in the selected folder(s) you want to analyze.

  5. You can specify a trim adapter list and set up parameters if you would like to trim your sequences (figure 3.5).

    Image viralhybridcapture2
    Figure 3.5: Choose trimming settings and optionally add an adapter trim list for trimming sequencing reads.

    The parameters that can be set are:

    • Trim ambiguous nucleotides: if checked, this option trims the sequence ends based on the presence of ambiguous nucleotides (typically N).
    • Maximum number of ambiguities: defines the maximal number of ambiguous nucleotides allowed in the sequence after trimming.
    • Trim using quality scores: if checked, and if the sequence files contain quality scores from a base-caller algorithm, this information can be used for trimming sequence ends.
    • Quality limit: defines the minimal value of the Phred score for which bases will not be trimmed.
    • Trim adapter list: Specifying a trim adapter list is optional but recommended to ensure the highest quality data for your analysis (figure 3.5)

    Learn about trim adapter lists at http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Trim_adapter_list.html)

  6. In the next wizard window "Taxonomic Profiling", select the viral reference database index you will use to map the reads (figure 3.6). It is also possible to "Filter host reads". You must then specify the index of the host genome (in the case of human virus, the Homo sapiens GRCh38 for example). Note that if your panel uses human control genes, a taxonomic profiling index of the human genome should be used as "Host index".

    Image viralhybridcapture3
    Figure 3.6: Select taxonomic profiling index

  7. In the next wizard window, select the viral reference database you will use to find the best matching reference (figure 3.7). The best matching reference will be used for read mapping and variant calling. If you wish to have variant calls annotated with amino acid changes the input database should contain CDS annotations.

    Image viralhybridcapture4
    Figure 3.7: Select viral reference database

  8. In the next wizard window, specify the parameters for the Low Frequency Variant Detection tool (figure 3.8). Note that variants are filtered after variant detection to coverage $ >$ 30x and frequency $ \geq$ 20%.

    Image viralhybridcapture5
    Figure 3.8: Specify the parameters to be used by the Low Frequency Variant Detection tool.

    The parameters that can be set are:

    • Required significance: The required significance level for low frequency variant calls.
    • Base quality filter: The base quality filter can be used to ignore the reads whose nucleotide at the potential variant position is of dubious quality.
    • Neighborhood radius: Determine how far away from the current variant the quality assessment should extend.
    • Minimum central quality: Reads whose central base has a quality below the specified value will be ignored. This parameter does not apply to deletions since there is no "central base" in these cases.
    • Minimum neighborhood quality: Reads for which the minimum quality of the bases is below the specified value will be ignored.
    • Read direction filter: The read direction filter removes variants that are almost exclusively present in either forward or reverse reads.
    • Direction frequency %: Variants that are not supported by at least this frequency of reads from each direction are removed.
    • Relative read direction filter: The relative read direction filter attempts to do the same thing as the Read direction filter, but does this in a statistical, rather than absolute, sense: it tests whether the distribution among forward and reverse reads of the variant carrying reads is different from that of the total set of reads covering the site. The statistical, rather than absolute, approach makes the filter less stringent.
    • Significance %: Variants whose read direction distribution is significantly different from the expected with a test at this level, are removed. The lower you set the significance cut-off, the fewer variants will be filtered out.
    • Read position filter: This filter removes variants that are located differently in the reads carrying it than would be expected given the general location of the reads covering the variant site.
    • Remove pyro-error variants: This filter can be used to remove insertions and deletions in the reads that are likely to be due to pyro-like errors in homopolymer regions. There are two parameters that must be specified for this filter:
    • In homopolymer regions with minimum length: Only insertion or deletion variants in homopolymer regions of at least this length will be removed.
    • With frequency below: Only insertion or deletion variants whose frequency (ignoring all non-reference and non-homopolymer variant reads) is lower than this threshold will be removed.

  9. In the Result handling window, pressing the button Preview All Parameters allows you to preview - but not change - all parameters. Choose to save the results (we recommend to create a new folder for it) and click Finish.

The output will be saved in the folder you selected. An example of the output can be seen in figure 3.9.

Image viralhybridcapture_output
Figure 3.9: Output of analysis of viral hybrid capture panel data

The output generated for each sample is:

For each batch analysis run, the following outputs are generated: