Three workflows are available for analyzing SARS-CoV-2 data (figure 4.1), one workflow is a generic workflow for use with ARTIC V3 SARS-CoV-2 primers designs, one workflow is customized for use with Ion AmpliSeq SARS-CoV-2 Research Panel data and the last workflow is customized for use with QIAseq DIRECT SARS-CoV-2 Panel data. All workflows can take one or multiple samples as input, which allows for analysis of a single sample or comparison of multiple samples based on a single workflow run.
The general approach of both workflows is mapping the reads to a reference, generating a consensus sequence from the mapping, calling variants, and generating outputs that allow for efficient review of results, including cross-sample comparison.
Two variant tracks are produced by each workflow, one containing variants likely to be true variants, those with frequencies between 50% and 100%, and another containing all potential variants, called low frequency variants with defaults down to between 10% and 20% depending on sequencing technology. Potential low frequency variants are likely to need further validation, as they may represent new mutations in the sample, but may be due to other factors, for example reverse transcriptase or sequencing errors.
In more detail, each workflow takes this general approach:
- Reads are trimmed, as needed for the sequencing protocol used.
- Reads are mapped to a reference.
- The mapping is locally re-aligned using a guidance track.
- Marginal reads, i.e. reads that contain large unaligned ends, as well as primers (when relevant), are removed from the mapping.
- A consensus sequence is generated from the mapping using Create Consensus Sequence from Variants, where consensus calls are made by substitution of identified variants found in a sample. Areas with coverage below 30x will be represented by ambiguous nucleotides (N).
- The Low Frequency Variant Detection tool is used to call variants in the mapping and variants are further filtered based on different quality metrics. Two variant tracks are generated, one with variants of frequencies above 50% and another with frequencies down to the low frequency cut-off (Default between 10% and 20%).
- Reports are generated by various tools in the workflow, and summaries of these reports are collected together and output as a combined report, which can be used for quality control.
- Track lists are generated, allowing for detailed, visual review of results.
Part of each workflow runs on each sample individually, with the per-sample results then being combined to aid inter-sample comparison. Thus, it is assumed that data for multiple samples will be provided when the workflow is launched. If data for only one sample is provided, the workflow will still run, and the results for the individual sample are still valid.
The workflow outputs can be used with the tools in CLC Microbial Genomics Module. Examples include:
- Understanding sample contamination through taxonomic profiling of unmapped reads.
- Functional analysis with BLAST and DIAMOND.
- Tree construction from consensus sequences or variant calls to trace the evolution of the virus.
- Identify ARTIC V3 SARS-CoV-2 Low Frequency and Shared Variants (Illumina)
- Identify QIAseq SARS-CoV-2 Low Frequency and Shared Variants (Illumina)
- Identify Ion AmpliSeq SARS-CoV-2 Low Frequency and Shared Variants (Ion Torrent)
- SARS-CoV-2 workflow output