Somatic variant detection

To call somatic variants, a number of steps are followed.

Firstly, positions of interest that may contain variation due to sequencing errors are identified. Two types of error are considered i) error that is due to the preceding sequenced nucleotides, and ii) error in homopolymer regions, which is typically dependent on how much the sample has been amplified. Empirical error distributions are made from both of these sequencing error types. For example, the probability of seeing a polyA homopolymer of length 18 given that the true length is 20 will be calculated.

Secondly, positions of interest that may contain variation not due to sequencing errors are identified. This identification is subject to user-controllable parameters (see the options under Variant detection and Variant detection general filters, LightSpeed Fastq to Somatic Variants). Groups of adjacent positions of interest form a cluster. Often, such a cluster is just a single position, but it may be arbitrarily long.

For each of these clusters, all overlapping read fragments are reduced to their intersection with the sites of the cluster. These reduced fragments are then used in the further analysis of the cluster.

To identify which underlying haplotypes are present within a given cluster, the pairwise compatibility of the fragments is determined. Once this is known, the largest groups of such pairwise-compatible fragments are formed. Each nonconflicting group is then turned into a haplotype candidate by piecing together the information from the fragments within the group.

At this stage, homopolymer variants are filtered. The filtering is applied when a cluster contains variants that imply at least two homopolymer haplotypes, and where at least one of the homopolymers has length >=5. The filtering removes homopolymer variants that are likely to have arisen by error from the other homopolymers, as determined by the empirical homopolymer error model. If homopolymer variants with lengths $ l_1, l_2, \dots l_n$, and counts $ c_1, c_2, \ldots, c_n$ respectively are present in the sample, then the filtering proceeds as follows:

  1. Calculate the frequencies of the homopolymers that would maximize the likelihood given the observed counts under the homopolymer model. The likelihood is (up to a normalizing constant):

    $\displaystyle \prod_{j=1}^{n} (\sum_{i=1, i \notin F}^{n} P(l_j \vert l_i) f_i) ^{c_j} $

    where $ P(l_j \vert l_i)$ is the probability of observing a homopolymer of length $ l_j$ by error when the true length is $ l_i$, and $ F$ is the set of indexes of homopolymers that have been filtered. The maximum likelihood is found by expectation-maximization.

  2. Omit each homopolymer variant in turn, starting with the homopolymer with fewest counts and again calculate the frequencies of the remaining homopolymers that would maximize the likelihood.
  3. If the Bayesian Information Criterion improves by removing the homopolymer variant, then it is filtered away. The Bayesian Information Criterion embodies our preference for simple explanations of the data by applying a penalty to the maximum likelihood for each unfiltered variant.

Once a list of haplotypes believed to be present in a given region is constructed, each of them needs to be assigned a count. Counts are assigned per-position to the haplotypes. In doing so, the haplotype-based per position counts are compared to the fragment-based per position counts to make sure the cumulative difference for all positions is minimized. This ensures assigning the counts that best reconcile the observed fragments with the underlying haplotypes.

Notes

In contrast to the germline variant caller (Germline variant detection), the somatic variant caller makes no assumptions about the ploidy of a sample, and thus allows for sensitive detection of variant alleles at low frequencies.

For insertions only, unaligned ends that are shorter than the full insertion, but matches the insertion sequence, contribute to the count and coverage.

A limit of maximum three alleles is enforced for each homopolymer locus and for alleles specifically marked with STR "Yes" that affect the same short tandem repeat. The alleles with the highest read counts are retained. See the description of the STR annotations and filter option under Variant filters in LightSpeed Fastq to Somatic Variants for details about STR annotation.

Variant types

LightSpeed Fastq to Somatic Variants reports SNPs, MNVs and InDels and replacements provided that the variants are contained within at least one paired end read and that their count and frequency satisfies the user-provided minimum requirements.

Variant annotations

Variants identified by LightSpeed Fastq to Somatic Variants are annotated with the following basic information: Chromosome, Region, Type, Reference, Allele, Reference allele, Length, Zygosity, Count, Coverage, Frequency, Forward read count, Reverse read count, Forward read coverage, Reverse read coverage, Forward/reverse balance and Genotype.

Read more about these general variant annotations here: https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_tracks.html.

In addition, the following LightSpeed specific annotations are available:

Note that for insertions, counts from unaligned ends that are shorter than the full insertion, but matches the insertion sequence, are included in the variant annotations Count, Coverage, Frequency, Count (singleton UMI), Count (big UMI), and Proportion (singleton UMIs). Counts from unaligned ends are not included in Forward read count, Reverse read count, Forward coverage, reverse coverage and Forward/reverse balance.