Detect MSI Status
Detect MSI Status can be used to detect if a sample contains unstable microsatellites. The tool can be found in the Toolbox:
Toolbox | Biomedical Genomics Analysis () | Oncology Score Estimation () | Detect MSI Status ()
The tool detects whether a sample is stable or not by comparing it to a baseline composed of multiple microsatellite stable (MSS) samples. Baselines can be created using the Generate MSI Baseline tool (see Generate MSI Baseline). This comparison if performed for each microsatellite locus and consists in evaluating whether the variations in the length distribution of the microsatellite observed in the sample are generally the same as the variations observed in the baseline samples.
We recommend that the MSI baseline is generated using samples that are sequenced under the same lab conditions as the sample for which the MSI status is calculated.
The same parameters must be used for generating the MSI baseline and detecting the MSI status in order to be able to compare the two length distributions. Therefore, Detect MSI Status automatically determines the parameters from the selected MSI baseline, see Generate MSI Baseline.
A microsatellite locus is said to be unstable when the length of the repeat region (e.g., tandem repeat of A nucleotides) is significantly different from the length in a normal sample. To measure the lengths in a sample, the following steps are used:
- For a given microsatellite locus, the flanking signature regions are identified in the reference genome on both sides of the locus.
For example, if the flanking signature is 8 bp long and the sequence is ...CTGACTGCTGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTCGTAGCA...
- where the sequence of repeated A's is the microsatellite - the left flanking signature is ACTGCTGG and the right flanking signature is TTTCGTAG.
- For every read that intersects the locus, we check if the left and right flanking signatures are present in the read.
- When the flanking region is short, it can be incorrectly identified in the read as the chance that the flanking signature is also present in other locations in the read is non-negligible.
To remove such reads, the nucleotide distribution of the reference sequence corresponding to the microsatellite locus is compared to the distribution from the read.
For each of the four nucleotides (A, C, G and T) the absolute difference in nucleotide fractions between the reference sequence and the read sequence is determined.
Reads for which the sum for all four nucleotides is larger than 0.3 are removed.
For example, for a 10 bp long homopolymer A region, the reference sequence has nucleotide fractions of 1.0 As and 0.0 Cs, Gs, and Ts.
If a read has one A to C mismatch in the homopolymer region, its nucleotide fractions would be 0.9 As, 0.1 Cs, and 0.0 Gs and Ts.
In this case the sum of the absolute differences would be 0.2, which is smaller than 0.3, hence the read is used in the MSI analysis.
- For every read where both flanking signatures are identified and the nucleotide distribution of the locus is similar to the reference sequence, the length of the locus is used to update a frequency distribution of microsatellite locus lengths. For paired end reads, the individual reads in the pair must contain both flanking signatures. For example, if read 1 contains only the left flanking signature (CTGACTGCTGGAAAAAAAAAAA) and read 2 contains only the right flanking signature (AAAAAAAAAAAAAAATTTCGTAGCA), the read pair cannot be used for MSI detection since it is not possible to determine the length of the microsatellite. If both reads in a pair contain both the left and the right flanking signatures, the read pair counts as two in the frequency distribution.
Detect MSI Status then measures the statistical variation of the length distribution of each microsatellite locus and decides for each locus if it is stable or not by comparing the statistical variation of the test sample with the baseline. If the proportion of unstable microsatellite loci is higher than a predefined threshold the sample is considered MSI-low or MSI-high depending on the parameters.
When running Detect MSI Status, you will first need to select a read mapping. In the next dialog, specify an MSI baseline track (figure 8.5).
Figure 8.5: Top: Parameters for Detect MSI Status. Bottom: MSI baselines from Reference Data Manager.
Two baseline tracks are available in the Reference Data Manager: one is based on 27 microsatellites covered by the primers in the Human TMB and MSI Panel (DHS-8800Z). This baseline was created using 30 MSS samples that were mapped to hg38 (no alternative analysis set) and processed with the Generate MSI Baseline tool using default parameters. The other is a subset of the first containing 9 loci. These loci were identified using lung FFPE MSS and MSI samples and were found to perform consistently well during benchmarking. All 9 loci are mononucleotide homopolymers. Note that the 27 loci baseline track should not be used for detecting MSI status. This baseline is intended for creating subsets of loci specific to a particular cancer type. To generate a cancer specific baseline, investigate the quality of the loci of interest using relevant read mappings. To create a subset of the baseline, use the Create Track from Selection button, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Showing_track_in_table.html. |
The following parameters can be adjusted:
- Dispersion measurement settings. For each locus in each of the baseline samples, a set of stable baseline lengths are identified containing all lengths with a frequency of at least 75% of the most frequent length from the baseline distribution (also known as the peak factor method).
Lengths that are up to a certain number of bases shorter or longer than those in the peak factor method set are also added to the final set.
The final baseline length set is created by combining all baseline lengths values from the individual baseline samples.
- Baseline stable length margin extends the baseline length by the selected margin. For example, if the peak factor method baseline length set is composed of 20, 21 and 29 and the margin is 2, the final set will be 18, 19, 22, 23, 27, 28, 30 and 31.
- Noise reduction threshold filters away locus lengths that are not supported by at least this number of reads.
- Minimum read count per locus defines the minimum number of reads required for a given locus to be considered testable.
- Coverage ratio measures the proportion of reads that have a microsatellite length that is present in the baseline length set. This is inversely correlated with how much a sample normally deviates from the baseline set. This metric is recommended for homogeneous baselines, i.e., relatively few observed baseline lengths forming a unimodal distribution.
- Earth mover's distance is obtained by calculating, for each bin that is not in the baseline set, the distance from the bin to the closest baseline length bin and multiplying this distance times the ratio of the bin and accumulating to get the result. It is correlated with the dispersion from the baseline length set and it increases linearly with the distance from the baseline set. This metric is recommended for heterogeneous baselines, i.e., a multimodal distribution of observed baseline lengths and/or relatively many observed baseline lengths.
- Baseline stable length margin extends the baseline length by the selected margin. For example, if the peak factor method baseline length set is composed of 20, 21 and 29 and the margin is 2, the final set will be 18, 19, 22, 23, 27, 28, 30 and 31.
- Statistical test settings. The locus length dispersion is measured for each locus of each sample (test and baseline samples). Therefore, at the end of this step, for each locus, there will be a dispersion value for the test sample and a set of dispersion values for the baseline samples. The tool uses a statistical variation test to decide if the value of the test sample could be considered similar to the baseline group. The tool provides the following methods to make this decision:
- Standard deviation. The sample is said to be unstable if its
- coverage ratio is less than the mean minus three times the standard deviation.
- dispersion value is more than the mean plus three times the standard deviation for earth mover's distance.
- Interquartile range. The interquartile range (IQR) is calculated by subtracting the first quartile (Q1) from the third (Q3). The sample is said to be unstable if its dispersion value is more than Q3 + 1.5 * IQR. Note: remember that the coverage ratio metric is inversely correlated with the dispersion and therefore the signals must be inverted. For the standard deviation test, the sample is unstable if the coverage ratio is less than the mean minus three times the standard deviation. The sample is said to be unstable if its
- coverage ratio less than Q1 - 1.5 * IQR.
- dispersion value is more than Q3 + 1.5 * IQR for earth mover's distance.
- Standard deviation. The sample is said to be unstable if its
- MSI status detection. Given the stability of the individual loci, the MSI status is determined based on the percentage of loci that are higher than a predefined threshold. There are two thresholds: one for low instability (MSI-L) and one for high instability (MSI-H). When calculating the percentage of unstable loci, only testable loci are taken into account and a minimum of 50% of the loci must be valid to make a call. If this is not the case, the status is set to Undetermined.
- Maximum percentage of unstable loci for MSS, set by default at 15%
- Minimum percentage of unstable loci for MSI-H, set by default at 40%
Subsections