Maximum Likelihood Phylogeny
Maximum Likelihood Phylogeny is a statistically grounded method that evaluates many possible tree topologies to find the one with the highest likelihood. The maximum likelihood tree estimation is performed under the assumption of one of five substitution models: the Jukes-Cantor, the Kimura 80, the HKY and the GTR (also known as the REV model) models. This is the most advanced and time-consuming tree construction method of those provided by the CLC Genomics Workbench.
A substitution model must be specified when launching this tool. To identify the most suitable substitution model, use the Model Testing tool.
To launch Maximum Likelihood Phylogeny, go to:
Tools | Classical Sequence Analysis (
) | Alignments and Trees (
) | Maximum Likelihood Phylogeny (
)
The tool accepts an alignment as input.
In the tool wizard, select the parameters for tree construction (figure 25.4):
Figure 25.4: Adjusting parameters for Maximum Likelihood Phylogeny.
- Start tree
- Tree construction method. Specify which distance-based method to use for creating the initial tree, Neighbor Joining or UPGMA:
- UPGMA. Assumes constant rate of evolution.
- Neighbor Joining. Well suited for trees with varying rates of evolution.
- Existing start tree. Alternatively, an existing tree can be used as a starting tree for the tree reconstruction. Click on the folder icon to the right of the text field to specify the desired starting tree.
- Tree construction method. Specify which distance-based method to use for creating the initial tree, Neighbor Joining or UPGMA:
- Select substitution model
- Nucleotide substitution model. Maximum likelihood tree estimation can be performed under the assumption of one of five nucleotide substitution models:
- Jukes-Cantor [Jukes and Cantor, 1969]
- Felsenstein 81 [Felsenstein, 1981]
- Kimura 80 [Kimura, 1980]
- HKY [Hasegawa et al., 1985]
- General Time Reversible (GTR) (also known as the REV model) [Yang, 1994a]
All models are time-reversible. In the Kimura 80 and HKY models, a transition/transversion ratio value may be set, which will be used as a starting value for optimization or as a fixed value, depending on the level of estimation chosen. For further details, see 25.4.2.
- Protein substitution model. Maximum likelihood tree estimation can be performed under the assumption of one of four peptide substitution models:
- Bishop-Friday [Bishop and Friday, 1985]
- Dayhoff (PAM) [Dayhoff et al., 1978]
- JTT [Jones et al., 1992]
- WAG [Whelan and Goldman, 2001]
The Bishop-Friday substitution model is similar to the Jukes-Cantor model for nucleotide sequences, i.e. it assumes equal amino acid frequencies and substitution rates. This is an unrealistic assumption and we therefore recommend using one of the remaining three models. The Dayhoff, JTT and WAG substitution models are all based on large-scale experiments where amino acid frequencies and substitution rates have been estimated by aligning thousands of protein sequences. For these models, the Maximum Likelihood Phylogeny tool does not estimate parameters, but simply uses those determined from these experiments.
- Nucleotide substitution model. Maximum likelihood tree estimation can be performed under the assumption of one of five nucleotide substitution models:
- Rate variation
To enable variable substitution rates among individual nucleotide sites in the alignment, select the Include rate variation box. When selected, the discrete gamma model of Yang [Yang, 1994b] is used to model rate variation among sites. The number of categories used in the discretization of the gamma distribution as well as the gamma distribution parameter may be adjusted by the user (as the gamma distribution is restricted to have a mean of 1, there is only one parameter in the distribution).
- Estimation
Estimation is done according to the maximum likelihood principle, that is, a search is performed for the values of the free parameters in the model assumed that result in the highest likelihood of the observed alignment [Felsenstein, 1981]. By ticking the Estimate substitution rate parameters box, maximum likelihood values of the free parameters in the rate matrix describing the assumed substitution model are found. If the Estimate topology box is selected, a search in the space of tree topologies for that which best explains the alignment is performed. If left un-ticked, the starting topology is kept fixed at that of the starting tree.
The Estimate Gamma distribution parameter is active if rate variation has been included in the model and in this case allows estimation of the Gamma distribution parameter to be switched on or off. If the box is left un-ticked, the value is fixed at that given in the Rate variation part. In the absence of rate variation estimation of substitution parameters and branch lengths are carried out according to the expectation-maximization algorithm[Dempster et al., 1977]. With rate variation the maximization algorithm is performed. The topology space is searched according to the PHYML method [Guindon and Gascuel, 2003], allowing efficient search and estimation of large phylogenies. Branch lengths are given in terms of expected numbers of substitutions per nucleotide site.
In the next step of the wizard it is possible to perform bootstrapping (figure 25.5).
Figure 25.5: Adjusting bootstrapping parameters.
- Bootstrapping
- Perform bootstrap analysis. Check this option to perform a bootstrap analysis.
- Replicates. The number of replicates used in the bootstrap analysis. The default value (100 replicates) is usually enough to distinguish between reliable and unreliable nodes in the tree.
