Germline variant detection

Based on the read mapping, germline variants are identified at positions where the read alignment supports a significant difference to the reference genome.

This is achieved through a site model, where each position is first assigned a likelihood for each of the genotypes A, C, T, G, N or missing. The algorithm then iterates over the read mapping and adjusts likelihoods per position for each genotype based on observations in the data until the likelihoods no longer change. Note that broken read pairs are not considered.

Each position is then inspected, and positions where the most likely genotype(s) are different from the reference sequence are identified.

At this stage, homopolymer variants with a homopolymer length of >=5 are re-called. This is done by calculating the likelihood of all possible genotypes based on the homopolymer length variants found in the reads, with the assumption that all other homopolymer variants in the reads arise by error. The likelihood is (up to a normalizing constant):

$\displaystyle \prod_{j=1}^{n} (\sum_{i} P(l_j \vert l_i) f_j) ^{c_j}$

where $P(l_j \vert l_i)$ is the probability of observing a homopolymer of length by error when the true length is , and is the number of fragments with a homopolymer of length . is the frequency of the homopolymer with length according to the genotype . For example, for diploid models this frequency can be 0, 0.5, or 1.0. The probabilities $P(l_j \vert l_i)$ are determined from the sample, by counting the number of homopolymer errors at positions that appear to be homozygous.

The final homopolymer variants are those that maximize the likelihood. However, if the maximum likelihood genotype is nearly homozygous (by which we mean all except one haplotype has the same variant), then we perform an additional test to see whether a ploidy-0.1:0.1 frequency ratio between the two variants has higher likelihood than the ploidy-1:1 frequency ratio. If it does, then we call the variant as homozygous. This ensures that low levels of noise are tolerated, and improves the accuracy of homopolymer calls.

Notes

Special handling is applied to variants supported by only 1 read that have a coverage of 1 or 2. For details, see the description of the Minimum allele count option under Variant filters in LightSpeed Fastq to Germline Variants.

A limit of maximum three alleles is enforced for each homopolymer locus and for alleles specifically marked with STR "Yes" that affect the same short tandem repeat. The alleles with the highest read counts are retained. See the description of the STR annotations and filter option under Variant filters in LightSpeed Fastq to Germline Variants for details about STR annotation.

When enabling the option use non-specific reads for variant detection, for sites with at least 80% ambiguous reads, the sensitivity to heterozygous events is increased. The reason is, that the non-specific reads often spread variant alleles across two or more similar sites, resulting in alleles with lower than 50% allele frequency at the individual sites.

Variant types

LightSpeed Fastq to Germline Variants reports SNVs, MNVs and InDels and replacements provided that the variants are contained within at least one paired end read.

Variant annotations

Variants identified by LightSpeed Fastq to Germline Variants are annotated with the following basic information: Chromosome, Region, Type, Reference, Allele, Reference allele, Length, Zygosity, Count, Coverage, Frequency, QUAL and Genotype. Only single base pair variants, that are not adjacent to any other variants, are assigned a QUAL score.

Read about general variant annotations here: https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Variant_tracks.html.

General annotations:
- Count The number of 'countable' reads supporting the allele. The 'countable' reads are those that are used by the variant detection tool when calling the variant. Which reads are 'countable' depends on the user settings when the variant calling is performed - if e.g. the user has chosen 'Ignore non-specific matches', reads belonging to non-specific read pairs are not 'countable'.
  For paired-end sequencing data:
  - Reads where R1 and R2 overlap only represent one fragment, and are counted only as one.
  - Reads where R1 and R2 disagree do not contribute to the count.
  For homopolymer variants, counts are assigned according to the homopolymer error model. This typically means that the assigned count is higher than is seen in the read mapping, because some observations of a given length are inferred to come from a more abundant homopolymer of a different length.
  For insertions, unaligned ends that are shorter than the full insertion, but match the insertion sequence, contribute to the count.
- Coverage The fragment coverage at this position. Both reads that support one of the called alleles (reads that are 'countable'), and reads that do not, are included.
  For paired-end sequencing data:
  - Reads where R1 and R2 overlap only represent one fragment, and are counted only as one when calculating the coverage.
  - Reads where R1 and R2 disagree do not contribute to the coverage.
- Average quality The average base quality score of the bases supporting a variant. The average quality score is calculated by adding the Q scores of the nucleotides supporting the variant and dividing this sum by the number of nucleotides supporting the variant. For deletions, the average quality score reported is the lowest average quality of the two bases neighboring the deleted one. For insertions, the average quality is calculated for each of the inserted bases in the reads supporting the insertion, and the minimum of the average base qualities is reported. Average quality is only calculated for non-reference alleles, for reference alleles no average quality is reported.
- Average mapping quality The average read mapping quality of specific proper read pairs supporting a variant. Mapping quality represents the confidence in how accurately each read is aligned to its genomic position based on alignment scores. A greater difference between the highest and second-best alignment scores corresponds to higher mapping quality, indicating reduced ambiguity in placement. Additional penalties are applied for reads where both ends are unaligned or where the read contains numerous mismatches to the reference sequence.
- Average mapping quality (incl. non-specific) Average mapping quality calculated from both specific and non-specific reads supporting a variant.
- STR Yes/No annotation. Yes, if the variant meets minimum repeat count, minimum repeat region length and maximum repeat element length specified in the wizard when calling variants. No, if one or more of the thresholds are not met.
- Repeat count The number of repeats excluding the variant. For example if a reference allele "AGAGAGAG" is called, and a low frequent stutter insertion allele is called "AGAGAGAGAG", the repeat unit is 2 and the repeat count is 4.
- Repeat unit length The length of a repeat unit. For example, for the dinucleotide repeat "AGAGAGAG", the repeat unit length is 2.
- Strand balance score 1 - (p-value from binomial test given forward count, count, and forward count/coverage).
Annotations added to variants that are called from UMI reads:
- Count (singleton UMI) The number of singleton UMI read pairs supporting the allele.
- Count (big UMI) The number of big UMI read pairs supporting the allele.
- Proportion (singleton UMIs) The fraction of singleton UMI read pairs relative to all UMI read pairs supporting the allele.
- Average size (UMIs) Average number of read pairs per UMI.
- Average size (simplex UMIs) Average number of read pairs per UMI for simplex UMI read pairs. The annotation is only added for duplex UMI protocols.
- Count (duplex UMIs) The number of duplex UMI read pairs supporting the allele. The annotation is only added for duplex UMI protocols.
- Average size (duplex UMIs) Average number of read pairs per UMI for duplex UMI read pairs. The annotation is only added for duplex UMI protocols.

Note that for insertions, counts from unaligned ends that are shorter than the full insertion, but matches the insertion sequence, are included in the variant annotations Count, Coverage, Frequency, Count (singleton UMI), Count (big UMI), and Proportion (singleton UMIs). Counts from unaligned ends are not included in Forward read count, Reverse read count, Forward coverage, reverse coverage and Forward/reverse balance.

Phasing

The tool LightSpeed Long Reads to Germline Variants (beta) adds phasing information to variants, provided read evidence supports it.

At positions where one of the alleles is filtered late in the variant calling algorithm, the other alleles at the same position are not included in a phase set.

The phasing information is available in the genotype track, see Genotype track.

Browse the manual

Germline variant detection

Notes

Variant types

Variant annotations

Phasing