Create SNP Tree

The Create SNP Tree tool is inspired by [Kaas et al., 2014]. There are two ways to initiate creation of a SNP tree: from the Result Metadata Table (see subsection 9.4) or by running the tool from the Toolbox. Note that you can only create a SNP tree if you have identified a common reference for the different stains you are trying to type, and used it for read mapping and variant calling for each of these samples.

To create a SNP tree from the Toolbox:

        Microbial Genomics Module (Image mgm_folder_closed_flat_16_h_p) | Typing and Epidemiology (beta) (Image typing_epi_folder_closed_16_h_p) | Create SNP Tree (Image te_snp_tree_16_h_p)

Select the relevant read mappings as shown in figure 13.1

Image stree1
Figure 13.1: Select read mappings to be included in the SNP tree analysis.

Alternatively, select data recursively by right-clicking on the folder name and selecting Add folder contents (recursively) (figure 13.2), but remember to double check that files relevant for the downstream analysis are selected. An efficient alternative to these methods is to use the Quick filtering functionality from the Metadata Result Table to filter easily the data and initiate the SNP tree creation.

Image recure_lowq
Figure 13.3: For selection of all sequence files in a folder, right click and select Add folder contents (recursively).

Select the variant tracks you want to use (figure 13.3). The variant tracks determine which positions to include in the SNP tree. The variant tracks need to have the same reference as the previously selected read mappings. Under normal circumstances you would select one variant track for each read mapping given in the input step, but that is not a requirement.

Image stree2
Figure 13.2: Select variant tracks and specify relevant parameters before generation of a SNP tree.

The following Parameters may be specified before the generation of the SNP tree (see figure 13.3):

The variant calls and read mapping results are used to determine the SNP positions used in the tree. Note that the variant tracks are only used to determine which positions to include in the SNP tree. Only the position and the type (SNP, and MNV if enabled) are used, whereas any information about reference and allele is ignored. The read mappings are then used to estimate the consensus sequence. Only a variant with relative frequency above 50% (haploid organisms) will be effectively considered.

The initial list of variants is reduced as the following: All but one variant from the initial variant lists that fall within the specified pruning distance (for example 10nt) are ignored. Positions that are not well or not covered in one or more read mappings ("Minimum coverage required in each sample" and "Minimum coverage of average required") are removed. In addition, all SNPs which do not have the minimal z-score are excluded.

The Neighbour Joining method is used to create the tree. Branch lengths are based on the distance between samples. The distance between two samples is computed as "Number of input positions used where the consensus sequence is different" / "Number of input positions used". The distance is therefore a number between 0 (no difference found in the input positions used) and 1 (all input positions used were different). From the tree, one can compute the distance between two samples by summing up all branches connecting them.



Subsections