Detect MSI Status

The MSI solution design is based on the general idea that it is possible to detect whether a sample is stable or not by comparing it to a baseline composed of multiple microsatellite stable (MSS) samples. This comparison consists in evaluating (on a "per microsatellite locus" basis) whether the variations in the length distribution of the microsatellite observed in the test sample are generally the same as the variations observed in the baseline samples.

A microsatellite locus is said to be unstable when the length of the repeat region (e.g. tandem repeat of A nucleotides) is significantly different from the length in a normal sample. To measure the lengths in a sample, the following steps are used:

  1. For a given microsatellite locus, we find the flanking signature regions in the reference genome on both sides of the locus. The flanking signature region length is a parameter, which is set to 8 bp by default. For example, for the sequence ...CTGACTGCTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTCGTAGCA... - where the sequence of repeated A's is the microsatellite - the left flanking signature is ACTGCTGG and the right flanking signature is TTTCGTAG.

  2. For every read that has an intersection with the locus (excluding broken pairs if desired), we check if the left and right flanking signatures can be found somewhere in the read. The process of searching for signatures may require an exact match or accept up to one mismatch.

  3. For every read with both flanking signatures, we calculate the number of base pairs between them and use this number to update a frequency distribution table of microsatellite locus lengths.

Detect MSI Status then measures the statistical variation of the length distribution of each microsatellite locus and decides for each locus if it is stable or not by comparing the statistical variation of the test sample with the normal samples' baseline. If the proportion of unstable microsatellite loci is higher than a predefined threshold, then the sample is considered unstable.

To run Detect MSI Status, go to:

        Toolbox | Biomedical Genomics Analysis (Image biomedical_folder_closed_16_n_p) | Oncology Score Estimation (Image oncology_tools_folder_closed_16_h_p) | Detect MSI Status (Image detect_msi_status_16_h_p)

First, you will need to select a UMI read mapping (figure 8.5).

Image detectmsi
Figure 8.5: Specify a UMI read mapping for the Detect MSI Status.

In the next dialog, specify a MSI baseline track (figure 8.6). It should be by default the msi_baseline_9_loci_v1 baseline.

Image detectmsi1
Figure 8.6: Parameters for the Detect MSI Status.

MSI baseline tracks are created using the Generate MSI Baseline tool.

Two template baseline tracks are available in the Reference Data Manager: one is based on a 27 microsatellite loci track containing all the microsatellite loci covered by the primers in the Human TMB and MSI Panel (DHS-8800Z). This baseline was created using 30 MSS samples that were mapped to the hg38 (no alternative analysis set) reference sequence and processed with the Generate MSI Baseline tool using the default parameters. The other is a subset of the first containing 9 loci. These loci were identified utilising lung FFPE MSS and MSI samples and were found to perform consistently well during benchmarking. All 9 loci are mono nucleotide homopolymers.

Note that the 27 loci baseline track should not be used for detecting MSI status. This baseline was included in the Reference Data Set to allow the creation of subsets of loci specific to a particular cancer type. To generate a cancer-specific baseline track, open the 27 loci baseline track along with the read mapping to investigate quality of the loci of interest. Then select them (in the table view of the track) and click on the Create Track from Selection button at the bottom of the table. Save the newly generated track in the Navigation Area.

The other parameters you can set are:

The tool can output the following files: a MSI loci track (which can be manually added to the track list produced by the TMB workflow), a MSI status report and a Baseline cross-validation report.

The MSI status of the test sample is then presented in the form of a report. The output report contains both combined and per loci information on stability or instability and other descriptive statistics related to the selected detection and test method. Coverage values are expressed in read coverage. Read Count refers to reads that have both delimiters and have been used for the calculation of the status. Coverage ratio allows assessment of the quality of the sequencing of the MSI loci: when few reads are represented to span the loci, it can either be because the loci in the sample is highly unstable thus creating longer reads not captured by spanning reads. This is most likely if the coverage is high. It can also be because the sequencing depth is to low to detect enough reads spanning the region, but in this case the coverage should be low.

When MSI status are calculated using "Earth Mover's Distance", the report will display values for this calculation. If they are too close to the "Stability Threshold", the stability detection of the threshold may not be accurate (as is the case for 2 different samples highlighted in figure 8.7).

Image examplemssmsi
Figure 8.7: Report showing that two different samples may have been falsely detected as Unstable or Stable.

When the "Create baseline cross-validation report" option is selected, the "Detect MSI Status" tool will generate a cross-validation report in the form of a table where the MSI status is presented for each sample in the baseline sample set. The baseline cross-validation analysis is a procedure to verify whether the samples used for the creation of the baseline are suitable for such task. In this procedure, the MSI status of each sample from the baseline sample set is tested against a baseline created using all other samples of the set. Ideally, it is expected that all samples will be detected as stable (MSS) with a very low proportion of unstable loci. If this is not the case, you may decide to remove one or more samples from the baseline sample set in case of doubt about the real status of the sample. Note that the cross-validation analysis is dependent on the parameters used for detection (exactly as for a test sample) and therefore each cross-validation is only valid for the selected set of parameter values used in the cross-validation run.

This report can be used together with the Combine Reports tool (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Combine_Reports.html)