Predict Methylation Profile

The Predict Methylation Profile tool estimates the composition of an input sample. The tool is primarily designed for use with the QIAseq Targeted Methyl panel "T Cell Infiltration Panel (MHS-202Z)", though it can be used in other settings.

To start the tool, go to:

        Toolbox | Epigenomics Analysis (Image epigenomics) | Bisulfite Sequencing (Image bisulfite_folder_closed_16_n_p)| Predict Methylation Profile (Image predict_sample_composition_16_n_p)

In the first dialog, choose a methylation levels track produced by the tool Call Methylation Levels. It is recommended that the Call Methylation Levels tool has been run with the option to Report unmethylated cytosines. Click Next to configure the following parameters:

Constructing a custom methylation database

A database of two or three pure types can be created by using the Create Methylation Database tool described in Create Methylation Database. Another option is to import a custom methylation database from an annotation file format such as GFF3, GTF, or BED, using Import Tracks (see

An example of a database with two pure types, "mcf" and "leuko", is provided in tab-delimited GFF3 format below. For more details of the format, refer to

##gff-version 3.1.26
1   .   sequence_feature   3567263 3567264 .   .   .   mcf=0.11;leuko=0.95
1   .   sequence_feature   3567272 3567273 .   .   .   mcf=0.16;leuko=1
1   .   sequence_feature   3567288 3567289 .   .   .   mcf=0.2;leuko=0.99

Note that every region must be annotated with every pure type - in the above example, the tool would skip regions that only had information for "mcf" and not "leuko". Also be sure to check that the imported coordinates match your expectations. This is easily done when the regions in the database are CpG sites, by making a new Track List of the reference track and the imported database, for details see:

When constructing a custom methylation database it is important to avoid bias. This can take different forms.

The Predict Methylation Profile algorithm

The algorithm splits an input sample into a mixture of the types listed in the Known methylation level columns parameter. If there are $ n$ such types, then we define the fraction that is of type $ i$ to be $ f_i$ and impose the constraints $ 0 \leq f_i \leq 1$, $ \sum_i^n f_i \leq 1$.

The regions of the methylation database are first filtered to remove regions whose coverage is lower than specified in the Minimum coverage option.

If the option Merge nearby regions is enabled, then the remaining regions are merged if they meet all the following conditions:

  1. They are separated on the genome by $ <1000$ nt with no other region between them.
  2. The known methylation levels, $ k_{i,j}$ for pure type $ i$ and the two adjacent regions $ j$ and $ j-1$ satisfy $ \vert k_{i,j} - k_{i,j-1} \vert \leq 0.05$, $ i = 1,2 \dots n$
  3. The 95% confidence intervals of the methylation levels of the two regions overlap. The intervals are calculated using the Wilson score interval.

The $ n_r$ merged regions are then used for prediction using a maximum likelihood approach. In the underlying model, regions can be "good" regions or "bad" regions, such that $ n_r = n_{good} + n_{bad}$. The predicted composition is only consistent with the methylation levels of the good regions.

The total number of bad regions is exponentially distributed to ensure that predictions requiring many bad regions are disfavored. Specifically, $ n_{bad} \sim \exp(1)$. This means that the number of bad regions is a free parameter (though bound such that $ n_{bad}/n_{r} < 0.4$) that must be determined at the same time as the sample compositions. This explicit modeling of bad regions reflects the observation that not all regions in the database will generalize to new samples as the methylation database is constructed from a limited number of samples.

For a given prediction, $ f_i$, $ i = 1,2 \dots n$, the expected methylation level at site $ j$ is $ \beta_{expected} = \sum_i^n f_i * k_{i,j}$

The likelihood of a region with $ m$ methylated reads and coverage $ c$ being good is binomially distributed:

$\displaystyle p_{good}(m \vert c, \beta_{expected}) \sim \mathcal{B}(c, \beta_{expected})$ (11.1)

All bad regions have the same likelihood, $ p_{bad}$, which is a uniform distribution over the total number of regions, $ n_r$:

$\displaystyle p_{bad} = \frac{1}{1+ n_r}$ (11.2)

The bad regions that maximize the log-likelihood for a given prediction $ f_i$, $ i = 1,2 \dots n$ are found by ordering regions by $ \log p_{bad} - \log p_{good}$ in ascending order. Then up to the first $ n_{bad}$ regions with $ \log p_{bad} - \log p_{good} < 0$ are bad regions.

The total log-likelihood for a given prediction is then:

log-likelihood$\displaystyle = \sum_{bad} p_{bad} + \sum_{good} p_{good} - n_{bad}$ (11.3)

The prediction maximizing the total log-likelihood is found by numerical optimization.

95% confidence intervals for $ f_i$ are estimated by taking the 5th percentile and 95th percentile values calculated from $ 1000$ bootstrap iterations. In each iteration, the same number of regions as used for the initial prediction are randomly chosen with replacement from the list of regions used for that prediction, and the prediction is re-calculated on these.

The Predict Methylation Profile report

The predict methylation profile report includes the following information:

Note that the 95% confidence intervals only approximate the error due to the prediction being based on a selection of a small number of regions of the genome. Additional sources of error include:

As a rule-of-thumb, one should suspect the presence of these additional sources of error when $ \sum_i^n f_i \ll 1$.

The Predict Methylation Profile region track

The columns of the predict methylation profile region track indicate: