Somatic variant detection
To call somatic variants, a number of steps are followed.
Firstly, positions of interest that may contain variation not due to sequencing errors are identified. This identification is subject to user-controllable parameters (see the options under Variant detection and Variant detection general filters, LightSpeed Fastq to Somatic Variants). Groups of adjacent positions of interest form a cluster. Often, such a cluster is just a single position, but it may be arbitrarily long.
For each of these clusters, all overlapping read fragments are reduced to their intersection with the sites of the cluster. These reduced fragments are then used in the further analysis of the cluster.
To identify which underlying haplotypes are present within a given cluster, the pairwise compatibility of the fragments is determined. Once this is known, the largest groups of such pairwise-compatible fragments are formed. Each nonconflicting group is then turned into a haplotype candidate by piecing together the information from the fragments within the group.
Once a list of haplotypes believed to be present in a given region is constructed, each of them needs to be assigned a count. Counts are assigned per-position to the haplotypes. In doing so, the haplotype-based per position counts are compared to the fragment-based per position counts to make sure the cumulative difference for all positions is minimized. This ensures assigning the counts that best reconcile the observed fragments with the underlying haplotypes.
Notes
In contrast to the germline variant caller (Germline variant detection), the somatic variant caller makes no assumptions about the ploidy of a sample, and thus allows for sensitive detection of variant alleles at any, and low, frequencies.
For insertions only, unaligned ends that are shorter than the full insertion, but matches the insertion sequence, contribute to the count and coverage.
Variant types
LightSpeed Fastq to Somatic Variants reports SNPs, MNVs and InDels and replacements provided that the variants are contained within at least one paired end read.
Variant annotations
Variants identified by LightSpeed Fastq to Somatic Variants are annotated with the following basic information: Chromosome, Region, Type, Reference, Allele, Reference allele, Length, Zygosity, Count, Coverage, Frequency, Forward read count, Reverse read count, Forward read coverage, Reverse read coverage, Forward/reverse balance and Genotype.
Read more about these general variant annotations here: https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_tracks.html.
In addition, the following LightSpeed specific annotations are available:
- General annotations:
- Average quality The average base quality score of the bases supporting a variant. The average quality score is calculated by adding the Q scores of the nucleotides supporting the variant and dividing this sum by the number of nucleotides supporting the variant. For deletions, the average quality score reported is the lowest average quality of the two bases neighboring the deleted one. For insertions, the average quality is calculated for each of the inserted bases in the reads supporting the insertion, and the minimum of the average base qualities is reported. Average quality is only calculated for non-reference alleles, for reference alleles no average quality is reported.
- p-value - global error rate p-value from binomial test given count, coverage and an error rate of 0.005. Note that if UMIs are utilized, i.e., in the UMI step a UMI preset has been selected or a custom read structure with UMIs has been specified (see LightSpeed Fastq to Somatic Variants), an error rate of 0.004 is used.
- p-value - global error rate (phred scaled) Log transformed p-value - global error rate.
- p-value - local error rate The minimum p-value from two individual tests: 1. A binomial test given forward count, forward coverage and a local error rate for forward reads estimated from the data. 2. A binomial test given reverse count, reverse coverage and a local error rate for reverse reads estimated from the data.
- p-value - low complexity p-value from binomial test given count and coverage. This p-value is only calculated for variants that are located in positions where two upstream and two downstream reference symbols are identical to the variant. For sites not living up to this criteria, a p-value of 0 is reported.
- Homopolymer/STR Yes/No annotation. Yes, if the variant meets minimum repeat count, minimum repeat region length and maximum repeat element length specified in the wizard when calling variants. No, if one or more of the thresholds are not met.
- Repeat count The number of repeats excluding the variant. For example if a reference allele "AAAA" is called, and a low frequent stutter insertion allele is called "AAAAA", the repeat unit is 1 and the repeat count is 4.
- Repeat unit length The length of a repeat unit. If the repeat is a homopolymer, the unit length is 1.
- Strand balance score 1 - (p-value from binomial test given forward count, count, and forward count/coverage).
- Inferred from unaligned ends Yes/no annotation indicating if the variant is a tandem duplication inferred from unaligned ends during detection of structural variants.
- Subtype Annotation indicating that an insertion is a tandem duplication. This annotation is added to tandem duplications inferred from unaligned ends during detection of structural variants, but also to insertions called by the standard variant caller that perfectly match a tandem duplication called during structural variant detection.
- Nearby similar called variant Annotation indicating if tandem duplications inferred from unaligned ends during structural variant detection resemble, but are not identical to an insertion called by the standard somatic variant detection.
- Annotations added to variants that are called from UMI reads:
- Count (singleton UMI) The number of singleton UMI read pairs supporting the allele.
- Count (big UMI) The number of big UMI read pairs supporting the allele.
- Proportion (singleton UMIs) The fraction of singleton UMI read pairs relative to all UMI read pairs supporting the allele.
- Average size (UMIs) Average number of read pairs per UMI.
- Average size (simplex UMIs) Average number of read pairs per UMI for simplex UMI read pairs. The annotation is only added for duplex UMI protocols.
- Count (duplex UMIs) The number of duplex UMI read pairs supporting the allele. The annotation is only added for duplex UMI protocols.
- Average size (duplex UMIs) Average number of read pairs per UMI for duplex UMI read pairs. The annotation is only added for duplex UMI protocols.
Note that for insertions, counts from unaligned ends that are shorter than the full insertion, but matches the insertion sequence, are included in the variant annotations Count, Coverage, Frequency, Count (singleton UMI), Count (big UMI), and Proportion (singleton UMIs). Counts from unaligned ends are not included in Forward read count, Reverse read count, Forward coverage, reverse coverage and Forward/reverse balance.