Analyze QIAseq xHYB Mycobacterium Tuberculosis Panel Data (Human host)
The Analyze QIAseq xHYB Mycobacterium Tuberculosis Panel Data (Human host) template workflow performs spoligotyping for lineage detection and identifies high-frequency antimicrobial drug resistance variants. It is suitable for analysis of samples from human hosts generated with the QIAseq xHYB Mycobacterium tuberculosis Panel.
To analyze samples not from human hosts, you can create a copy of the workflow and edit it to fit your specific application, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Template_workflows.html. Since the workflow element Map Reads to Human Control Genes is relevant for human data only, you should delete this. In addition, if a host genome is not relevant for you application, open the Taxonomic Profiling workflow element, and uncheck Filter host reads.
Once the workflow copy is customized, you can install it to make it available from the Toolbox, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Installing_workflow.html.
To run the workflow using a variant database other than the default one, you need to modify the workflow elements where the database name appears as a column header, such as Filter for WHO variants and WHO variant associated with resistance.
QIAGEN reference data set
The QIAseq xHYB Mycobacterium tuberculosis Panel reference data set is available from QIAGEN Sets Reference Data Library accessible via References () in the top Toolbar. It includes the Mycobacterium tuberculosis reference genome H37Rv and the WHO Mycobacterium tuberculosis variant database based on the WHO Mycobacterium tuberculosis mutation catalogue, see Reference Data Elements.
Like the template workflow, the reference data set is designed for human samples. It contains both a human host taxonomic profiling index, and a sequence list with human control genes for use in the workflow step Map Reads to Human Control Genes.
For analysis of samples not from human hosts, if a host is relevant for your application, you can create a host taxonomic profiling index from your host reference genome using Create Taxonomic Profiling Index, see Create Taxonomic Profiling Index.
The workflow analysis
The raw Mycobacterium tuberculosis whole genome sequencing reads are trimmed for low quality, read-through adapter sequences, and G homopolymers. Trimmed reads are used as input for the separate spoligotyping analysis.
In the Taxonomic Profiling step, reads that map to the human host index are filtered. As a quality control step, these reads are subsequently mapped to the human control genes defined for the panel. In addition to human reads, reads identified as belonging to taxonomies other than Mycobacterium tuberculosis are excluded from downstream analysis.
The remaining reads are mapped to the Mycobacterium tuberculosis reference genome, and variants are called from this read mapping. The reference genome may differ from the lineage reported by the spoligotyping step. Using the same reference genome for mapping and variant calling across samples ensures comparability of variants and facilitates alignment with variant databases, such as the WHO Mycobacterium tuberculosis mutation catalogue, which are based on a specific genome. Variant calling is optimized for calling resistance in the dominant strain of an infection: variants with frequency beneath 50% will typically not be reported.
Detected variants are compared to a drug resistance variant database and annotated with drug resistance information.
Launching the workflow
Before launching the workflow, make sure to download the QIAseq xHYB Mycobacterium tuberculosis Panel reference data set.
The Analyze QIAseq xHYB Mycobacterium Tuberculosis Panel Data (Human host) workflow is at:
Toolbox | Template Workflows () | Microbial Workflows () | QIAseq Analysis () | Analyze QIAseq xHYB Mycobacterium Tuberculosis Panel Data (Human host) ()
Launch the workflow and step through the wizard.
- Select the sequence list(s) containing the reads to analyze. If selecting multiple inputs from different samples, check the Batch option, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Running_workflows_in_batch_mode.html. Click on Next.
- Choose the option "Use the default reference data" (figure 2.68). Click on Next.
- If Batch was checked in step 1, choose whether batch units should be defined based on organization of the input data, or by provided metadata. In the next step, review the batch units resulting from your selections above. Click on Next.
- Specify the spoligotyping settings (figure 2.69). Using the default values is usually sufficient, but we recommend taking a look at the spoligotyping report afterwards to make sure the results are as expected.
- Finally, select a location to save outputs to and click on Finish.
Figure 2.68: Select reference data set.
Figure 2.69: Select the minimum threshold settings for spoligotyping.
Workflow outputs and how to interpret
The outputs provided by the workflow are:
- QC & Reports. Folder containing the individual reports generated during the analysis.
- All reports from the sample report are found here in their full length.
- Tracks. Folder containing various tracks.
- Genome, Gene and CDS tracks based on the Mycobacterium tuberculosis reference used.
- Read_mapping. Reads track of the sample reads mapping to the reference genome.
- Amino_acid_track. Track to see amino acids and potential changes in coding sequences of the reference genome.
- Human_control_genes_read_mapping. Track to see mapping of the human host reads to the control genes.
- Variants. Folder containing all the variant tracks generated during the analysis.
- Raw_variants. Variant track containing all detected raw variants i.e., before adjusting with Join Nearby Variants (Join Nearby Variants) and annotating.
- WHO_variants_detected. Variant track containing only variants from the WHO resistance database.
- Novel_variants_detected. Variant track containing only variants that are not graded by the WHO.
- Genome Browser. A track list containing the reference genome, gene, CDS, read mapping, variant, and amino acid changes tracks.
- QIAseq xHYB Mycobacterium Tuberculosis Analysis Report. Sample report containing results of the analysis. The sample report is curated to contain the most important information for analysis interpretation, but all full reports can be found in the QC & Reports folder.
- Annotated variants. Variant track containing all detected variants after readjustment and annotated with WHO resistance, amino acid changes and gene information.
The sample report "QIAseq xHYB Mycobacterium Tuberculosis Analysis Report" is the main output of the workflow. This allows for easy overview of the analysis results, both in terms of quality control and detected drug resistance for the sample. An example of the report can be seen in figure 2.70.
Figure 2.70: An example report from the Analyze QIAseq xHYB Mycobacterium Tuberculosis Panel Data (Human host) workflow.
The report contains the following sections:
- Sections 1-5 contain quality metrics for the analysis:
- QC for sequencing reads. A summary of the number of raw reads and their quality. If the reads are of too low quality the results may be unreliable.
- Trim reads. A summary of the read trimming. If the percentage of reads after trim is low or the average read length after trimming is considerably lower than before trimming, it may be a sign that something is wrong with the sample reads.
- Human control genes coverage. A summary of the host reads mapping to the human control genes. The coverage can be low, but there should be some reads mapping to the genes. If not, something may have gone wrong during the sample prep, or the sample was not made with the QIAseq xHYB Mycobacterium tuberculosis Panel.
- Remove duplicate mapped reads. A high percentage of duplicates may indicate that the sample contains little gDNA.
- QC for read mapping. For the QIAseq xHYB Mycobacterium tuberculosis panel, the coverage percentage should be close to 100%. Also, most of the reads after trimming (see Reads after trim in the Trim reads section) should be mapped. If this is not the case, there may have been an issue with the sample prep.
- Sections 6-10 contain lineage and variant results from the analysis:
- Spoligotype Mycobacterium tuberculosis. Results of spoligotyping. This reports on the detected SIT, lineage, sublineage, and spoligotype pattern. It can be a good idea to take a look at the coverage plot in the full spoligotyping report (/QC & Reports/Spoligotyping_report), to ascertain whether the minimum threshold has been correctly set. For additional information about the spoligotype report content, see Spoligotype Mycobacterium Tuberculosis output.
- WHO 2023 variants associated with resistance. Variants detected in the sample that have been graded "1)" or "2)" for at least one drug by the WHO. As variants can be graded for multiple drugs with different grades, this section may contain grades of "3)" and higher as well. For information about WHO grading, see Reference Data Elements.
- WHO 2023 variants of uncertain significance. Variants detected in the sample that have been graded "3)" for at least one drug by the WHO, but not "1)" or "2)". As variants can be graded for multiple drugs with different grades, this section may contain grades of "4)" and higher as well.
- WHO 2023 variants not associated with resistance. Variants detected in the sample that have only been graded "4)" or "5)" by the WHO.
- Novel variants in antibiotic resistance genes. Variants detected in the sample, but that are not graded by the WHO. The report only contains variants in known resistance genes, and excludes variants in protein-coding regions that result in synonymous mutations. To view all detected novel variants, look at the "/Variants/Novel_variants_detected" variant track.
The variant table reports contain the following columns:
- Gene. For WHO variants, this is the gene with which the variant is associated. For Novel variants, it is the gene in which the variant is located.
- Pos.. The genomic position of the variant within the reference genome.
- Variant. (Only WHO variants). The name(s) of the variant as given by WHO. The name consists of the gene in which the variant is located, along with the corresponding position and change, either as a nucleotide or amino acid change.
- AA change. (Only Novel variants). This describes the change on the protein level. For example, single amino-acid changes caused by SNVs are listed as p.Gly261Cys, denoting that in the protein sequence (hence the "p.") the Glycine at position 261 is changed into Cysteine. Frame-shifts caused by nucleotide insertions and deletions are listed with the extension fs, for example p.Pro244fs denoting a frameshift at position 244 coding for Proline. For further details about HGVS nomenclature as relates to proteins, see http://varnomen.hgvs.org/recommendations/protein/.
- Freq.. The number of reads supporting the allele divided by the number of reads covering the position of the variant. Note that variants with frequency beneath 50% will typically not be reported.
- QUAL. Measure of the significance of a variant, i.e., a quantification of the evidence (read count) supporting the variant, relative to the coverage and what could be expected to be seen by chance, given the error rates in the data. For additional information, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_tracks.html.
- Drug. (Only WHO variants). The antimicrobial resistance drug(s) for which the variant is graded.
- Grade. (Only WHO variants). The grade of drug resistance determined for the variant.
For more info on the WHO variant database, including the resistance grades, see Reference Data Elements.