Detect MSI Status

Detect MSI Status can be used to detect if a sample contains unstable microsatellites. It is available from the Tools menu at:

        Tools | Biomedical Genomics Analysis (Image biomedical_folder_closed_16_n_p) | Oncology Score Estimation (Image oncology_tools_folder_closed_16_h_p) | Detect MSI Status (Image detect_msi_status_16_h_p)

The tool detects whether a sample is stable or not by comparing it to a baseline composed of multiple microsatellite stable (MSS) samples. Baselines can be created using the Generate MSI Baseline tool (see Generate MSI Baseline). This comparison is performed separately for each microsatellite locus and consists in evaluating whether the variations in the length distribution of the microsatellite observed in the sample are generally the same as the variations observed in the baseline samples.

We recommend that the MSI baseline is generated using samples that are sequenced under the same lab conditions as the sample for which the MSI status is calculated. The Detect MSI Status tool automatically inherits parameters from the selected MSI baseline and uses these parameters for generating a length distribution of the sample. This ensures that the length distributions are comparable between the baseline and the sample.

A microsatellite locus is said to be unstable when the length of the repeat region (e.g., tandem repeat of A nucleotides) is significantly different from the length in microsatellite stable (MSS) samples. To measure the locus lengths in a sample read mapping, the following steps are used:

  1. For a given microsatellite locus, the flanking signature regions are identified in the reference genome on both sides of the locus. For example, if the flanking signature is 8 bp long and the sequence is GACTGCTGGAAAAAAAAAATTTCGTAGC - where the sequence of repeated A's is the microsatellite - the left flanking signature is ACTGCTGG and the right flanking signature is TTTCGTAG.

  2. The tool searches for the left and right flanking signatures in all reads intersecting the locus.

  3. The flanking signature might be present more than once in a read. This is increasingly likely with shorter flanking signatures. To account for this, we compare the nucleotide distribution for the microsattelite locus observed in the read to the reference sequence. This is done by determining the absolute difference in nucleotide fractions for each of the four nucleotides (A, C, G and T) between the reference sequence and the read sequence. Reads for which the sum for all four nucleotides is larger than 0.3 are removed. For example, for a 10 bp long homopolymer A region, the reference sequence has nucleotide fractions of 1.0 A's and 0.0 C's, G's, and T's. If a read has one A to C mismatch in the homopolymer region, its nucleotide fractions would be 0.9 A's, 0.1 C's, and 0.0 G's and T's. In this case the sum of the absolute differences would be 0.2, which is smaller than 0.3, hence the read is used in the MSI analysis.

  4. For every read where both flanking signatures are identified and the nucleotide distribution of the locus is similar to the reference sequence, the length of the locus is used to update a frequency distribution of microsatellite locus lengths. A paired end read is only used if at least one of the reads in the read pair contains both flanking signatures. If, for example, read 1 contains only the left flanking signature (GACTGCTGGAAAAAA) and read 2 contains only the right flanking signature (AAAAAATTTCGTAGC), the read pair cannot be used for MSI detection since it is not possible to determine the length of the microsatellite. If both reads in the read pair contains both the left and right flanking signatures, the reads will count as one (and not two) in the frequency distribution, since the two reads originate from the same DNA fragment.

After counting the locus lengths in all reads, the statistical variation of the length distribution is calculated and compared to the baseline to determine if the locus is stable or unstable. If the proportion of unstable microsatellite loci is higher than a predefined threshold the sample is considered MSI-low or MSI-high depending on the settings.

When running Detect MSI Status, you will first need to select a read mapping. In the next dialog, specify an MSI baseline track (figure 8.5).

Image detect_msi_status_wizard
Figure 8.5: Top: Parameters for Detect MSI Status. Bottom: MSI baselines from Reference Data Manager.

Two baselines are available in the Reference Data Manager:

  • dna_msisensor2_baseline_v1.3 is for QIAseq Targeted DNA panels, suitable for Human TMB and MSI Panel (DHS-8800Z) and Multimodal Pan-Cancer Panel (UHS-5000Z). The baseline is created using 30 MSS samples that were mapped to hg38 (no alternative analysis set) and processed with the Generate MSI Baseline tool using default parameters and the msisensor2_loci_v1.0 loci track.

  • dna_pro_msi_baseline_9_loci_demo_v1.0 is for QIAseq Targeted DNA Pro panels (PHS-001Z, PHS-002Z, PHS-101Z, PHS-102Z, PHS-202Z, PHS-205Z, PHS-3000Z, PHS-3100Z, PHS-3200Z). It is generated using 20 MSS samples that were mapped to hg38 (no alternative analysis set) and processed with the Generate MSI Baseline tool using default parameters and the qiaseq_msi_9_loci_v1.0 loci track. The baseline is only for demo use since it is generated with fewer than 30 samples.

The following parameters can be adjusted:



Subsections