The Analyze Viral Hybrid Capture Panel Data is designed for detecting viruses, calculating abundances and for calling variants for the viruses identified from data generated using hybrid capture panels. The workflow (figure 6.21) performs read trimming, creates a QC report, cleans the dataset of host DNA, calculates viral abundances and maps the reads to the most abundant viral reference for variant calling. Note: After taxonomic profiling, the viral reads are downsampled to maximum 500,000 read pairs. Also note: If your panel does not contain control genes, the workflow should be modified by right clicking on the workflow in the tool box and opening a copy of the workflow. Then remove the Map Reads to Human Control Genes tool plus its input and output and save the modified workflow. When you run the modified workflow, human control genes are no longer required.
Preliminary steps to run the Analyze Viral Hybrid Capture Panel Data workflow Before starting the workflow,
- Download reference viral genomes using either the Download Custom Microbial Reference Database (see the Working with databases chapter), the viral databases from the Download Curated Microbial Reference Database tool or create a database using the Create Annotated Sequence List.
- Create a taxonomic profiling index for calculating abundance (see Create Taxonomic Profiling Index).
- Specify the sample(s) or folder(s) of samples you would like analyze (figure 6.22) and click Next. Note that if you select several items, they will be run as batch units.
- Specify the human control genes as a sequence list here (figure 6.23). Alternatively, if your panel does not contain control genes, the workflow should be modified by right clicking on the workflow in the tool box and opening a copy of the workflow. Then remove the Map Reads to Reference tool plus its input and output and save the modified workflow. When you run the modified workflow, human control genes are no longer required.
- Define batch units using organisation of input data to create one run per input or use a metadata table to define batch units. Click Next.
- The next wizard window gives you an overview of the samples present in the selected folder(s). Choose which of these samples you want to analyze in case you are not interested in analyzing all the samples from a particular folder (figure 6.24).
- You can specify a trim adapter list and set up parameters if you would like to trim your sequences (figure 6.25).
The parameters that can be set are:
- Trim ambiguous nucleotides: if checked, this option trims the sequence ends based on the presence of ambiguous nucleotides (typically N).
- Maximum number of ambiguities: defines the maximal number of ambiguous nucleotides allowed in the sequence after trimming.
- Trim using quality scores: if checked, and if the sequence files contain quality scores from a base-caller algorithm, this information can be used for trimming sequence ends.
- Quality limit: defines the minimal value of the Phred score for which bases will not be trimmed.
- Trim adapter list: Specifying a trim adapter list is optional but recommended to ensure the highest quality data for your typing analysis (figure 6.25)
- In the next wizard window "Taxonomic Profiling", select the viral reference database index you will use to map the reads (figure 6.26). It is also possible to "Filter host reads". You must then specify the index of the host genome (in the case of human virus, the Homo sapiens GRCh38 for example). Note that if your panel uses human control genes, a taxonomic profiling index of the human genome should be used as "Host index".
- In the next wizard window, select the viral reference database you will use to find the best matching reference (figure 6.27). The best matching reference will be used for read mapping and variant calling. If you wish to have variant calls annotated with amino acid changes the input database should contain CDS annotations.
- In the next wizard window, specify the parameters for the Low Frequency Variant Detection tool (figure 6.28). Note that variants are filtered after variant detection to coverage >= 30x and frequency >= 20%.
The parameters that can be set are:
- Required significance: The required significance level for low frequency variant calls.
- Base quality filter: The base quality filter can be used to ignore the reads whose nucleotide at the potential variant position is of dubious quality.
- Neighborhood radius: Determine how far away from the current variant the quality assessment should extend.
- Minimum central quality: Reads whose central base has a quality below the specified value will be ignored. This parameter does not apply to deletions since there is no "central base" in these cases.
- Minimum neighborhood quality: Reads for which the minimum quality of the bases is below the specified value will be ignored.
- Read direction filter: The read direction filter removes variants that are almost exclusively present in either forward or reverse reads.
- Direction frequency %: Variants that are not supported by at least this frequency of reads from each direction are removed.
- Relative read direction filter: The relative read direction filter attempts to do the same thing as the Read direction filter, but does this in a statistical, rather than absolute, sense: it tests whether the distribution among forward and reverse reads of the variant carrying reads is different from that of the total set of reads covering the site. The statistical, rather than absolute, approach makes the filter less stringent.
- Significance %: Variants whose read direction distribution is significantly different from the expected with a test at this level, are removed. The lower you set the significance cut-off, the fewer variants will be filtered out.
- Read position filter: This filter removes variants that are located differently in the reads carrying it than would be expected given the general location of the reads covering the variant site.
- Remove pyro-error variants: This filter can be used to remove insertions and deletions in the reads that are likely to be due to pyro-like errors in homopolymer regions. There are two parameters that must be specified for this filter:
- In homopolymer regions with minimum length: Only insertion or deletion variants in homopolymer regions of at least this length will be removed.
- With frequency below: Only insertion or deletion variants whose frequency (ignoring all non-reference and non-homopolymer variant reads) is lower than this threshold will be removed.
- In the Result handling window, pressing the button Preview All Parameters allows you to preview - but not change - all parameters. Choose to save the results (we recommend to create a new folder for it) and click Finish.
The output generated for each sample is:
- QC report raw reads: QC report on the raw reads. Contains information on number of input reads, length, quality and nucleotide distributions.
- Viral reads: list of the sequences that were successfully trimmed and mapped to the best reference.
- TaxPro report: output from the Taxonomic Profiling tool. Contains information on number of reads mapping to the reference database and the host, if host filtering was enabled.
- Read mapping human control genes: mapping of the reads to the human control genes.
- Read mapping human control genes report: contains information on the read mapping to the selected controls such as number of reads mapped and read length distributions.
- Read mapping: output from the Local Realignment tool, mapping of the reads to the best match reference.
- Best reference report: contains information on best reference and the number of reads and unique reads mapped to this reference.
- Best match sequence: the sequence that matches the data best according to the Find Best References using Read Mapping tool.
- Consensus sequence: a consensus sequence generated from the read mapping. Consensus is not calculated in low coverage regions. These positions are instead replaced with Ns.
- Low coverage areas: track of regions of the best match reference genome with coverage < 30x.
- Trim report: report from the Trim Sequences tool (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Trim_output.html).
- Abundance table: contains abundance for all detected viral species.
- Viral reads: read mapping to the viral reference database.
- Annotated variant track: output from the Low Frequency Variant Detection tool. Note that variants are post filtered to coverage >= 30x and frequency <= 20%.
- Amino acid track: only generated if the best match contains CDS annotations.
- Combined report: combines the information from the output report including QC, taxonomic profiling and mapping reports.
- Merged abundance table: An table containing abundance for all input samples. See figure 6.30 for an example of an analysis run of two HPV samples.
Merged abundance tables can be used as input for various tools: