Detect MSI Status

Detect MSI Status can be used to detect if a sample contains unstable microsatellites. The tool can be found in the Toolbox:

        Toolbox | Biomedical Genomics Analysis (Image biomedical_folder_closed_16_n_p) | Oncology Score Estimation (Image oncology_tools_folder_closed_16_h_p) | Detect MSI Status (Image detect_msi_status_16_h_p)

The tool detects whether a sample is stable or not by comparing it to a baseline composed of multiple microsatellite stable (MSS) samples. Baselines can be created using the Generate MSI Baseline tool (see Generate MSI Baseline). This comparison if performed for each microsatellite locus and consists in evaluating whether the variations in the length distribution of the microsatellite observed in the sample are generally the same as the variations observed in the baseline samples.

We recommend that the MSI baseline is generated using samples that are sequenced under the same lab conditions as the sample for which the MSI status is calculated.

The same parameters must be used for generating the MSI baseline and detecting the MSI status in order to be able to compare the two length distributions. Therefore, Detect MSI Status automatically determines the parameters from the selected MSI baseline, see Generate MSI Baseline.

A microsatellite locus is said to be unstable when the length of the repeat region (e.g., tandem repeat of A nucleotides) is significantly different from the length in a normal sample. To measure the lengths in a sample, the following steps are used:

  1. For a given microsatellite locus, the flanking signature regions are identified in the reference genome on both sides of the locus. For example, if the flanking signature is 8 bp long and the sequence is ...CTGACTGCTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTCGTAGCA... - where the sequence of repeated A's is the microsatellite - the left flanking signature is ACTGCTGG and the right flanking signature is TTTCGTAG.

  2. For every read that intersects the locus, we check if the left and right flanking signatures are present in the read.

  3. When the flanking region is short, it can be incorrectly identified in the read as the chance that the flanking signature is also present in other locations in the read is non-negligible. To remove such reads, the nucleotide distribution of the reference sequence corresponding to the microsatellite locus is compared to the distribution from the read. For each of the four nucleotides (A, C, G and T) the absolute difference in nucleotide fractions between the reference sequence and the read sequence is determined. Reads for which the sum for all four nucleotides is larger than 0.3 are removed. For example, for a 10 bp long homopolymer A region, the reference sequence has nucleotide fractions of 1.0 As and 0.0 Cs, Gs, and Ts. If a read has one A to C mismatch in the homopolymer region, its nucleotide fractions would be 0.9 As, 0.1 Cs, and 0.0 Gs and Ts. In this case the sum of the absolute differences would be 0.2, which is smaller than 0.3, hence the read is used in the MSI analysis.

  4. For every read where both flanking signatures are identified and the nucleotide distribution of the locus is similar to the reference sequence, the length of the locus is used to update a frequency distribution of microsatellite locus lengths. For paired end reads, the individual reads in the pair must contain both flanking signatures. For example, if read 1 contains only the left flanking signature (CTGACTGCTGGAAAAAAAAAAA) and read 2 contains only the right flanking signature (AAAAAAAAAAAAAAATTTCGTAGCA), the read pair cannot be used for MSI detection since it is not possible to determine the length of the microsatellite. If both reads in a pair contain both the left and the right flanking signatures, the read pair counts as two in the frequency distribution.

Detect MSI Status then measures the statistical variation of the length distribution of each microsatellite locus and decides for each locus if it is stable or not by comparing the statistical variation of the test sample with the baseline. If the proportion of unstable microsatellite loci is higher than a predefined threshold the sample is considered MSI-low or MSI-high depending on the parameters.

When running Detect MSI Status, you will first need to select a read mapping. In the next dialog, specify an MSI baseline track (figure 8.5).

Image detectmsi
Figure 8.5: Top: Parameters for Detect MSI Status. Bottom: MSI baselines from Reference Data Manager.

Two baseline tracks are available in the Reference Data Manager: one is based on 27 microsatellites covered by the primers in the Human TMB and MSI Panel (DHS-8800Z). This baseline was created using 30 MSS samples that were mapped to hg38 (no alternative analysis set) and processed with the Generate MSI Baseline tool using default parameters. The other is a subset of the first containing 9 loci. These loci were identified using lung FFPE MSS and MSI samples and were found to perform consistently well during benchmarking. All 9 loci are mononucleotide homopolymers.

Note that the 27 loci baseline track should not be used for detecting MSI status. This baseline is intended for creating subsets of loci specific to a particular cancer type. To generate a cancer specific baseline, investigate the quality of the loci of interest using relevant read mappings. To create a subset of the baseline, use the Create Track from Selection button, see

The following parameters can be adjusted: