Two workflows are available for analyzing SARS-CoV-2 data (figure 4.1), one workflow is customized for use with Ion AmpliSeq SARS-CoV-2 Research Panel data and the other is customized for use with QIAseq SARS-CoV-2 Panel data. Both workflows can take one or multiple samples as input, which allows for analysis of a single sample or comparison of multiple samples based on a single workflow run.
The general approach of both workflows is mapping the reads to a reference, generating a consensus sequence from the mapping, calling variants, and generating outputs that allow for efficient review of results, including cross-sample comparison.
Two variant tracks are produced by each workflow, one containing variants likely to be true variants, those with frequencies between 50% and 100%, and another containing potential variants, those with frequencies between 20% and 50%. Potential variants are likely to need further validation, as they may represent new mutations in the sample, but may be due to other factors, for example reverse transcriptase or sequencing errors.
In more detail, each workflow takes this general approach:
- Reads are trimmed, as needed for the sequencing protocol used.
- Reads are mapped to a reference.
- The mapping is locally re-aligned using a guidance track.
- Marginal reads, i.e. reads that are error prone or contain large unaligned ends, as well as primers (when relevant), are removed from the mapping.
- A consensus sequence is generated from the mapping using Extract Consensus Sequence, where consensus calls are made by voting using quality scores. Areas with coverage below 30x will be represented by ambiguous nucleotides.
- The Fixed Ploidy Variant Detection tool is used to call variants in the mapping. Two variant tracks are generated, one with variants with frequencies above 50% and another with frequencies above 20%.
- Reports are generated by various tools in the workflow, and summaries of these reports are collected together and output as a combined report, which can be used for quality control.
- Track lists are generated, allowing for detailed, visual review of results.
Part of each workflow runs on each sample individually, with the per-sample results then being combined to aid inter-sample comparison. Thus, it is assumed that data for multiple samples will be provided when the workflow is launched. If data for only one sample is provided, the workflow will still run, and the results for the individual sample are still valid.
The workflow outputs can be used with the tools in CLC Microbial Genomics Module. Examples include:
- Understanding sample contamination through taxonomic profiling of unmapped reads.
- Functional analysis with BLAST and DIAMOND.
- Tree construction from consensus sequences or variant calls to trace the evolution of the virus.
- Identify QIAseq SARS-CoV-2 Low Frequency and Shared Variants (Illumina)
- Identify Ion AmpliSeq SARS-CoV-2 Low Frequency and Shared Variants (Ion Torrent)
- SARS-CoV-2 workflow output