Create K-mer Tree

The Create K-mer Tree tool may be helpful for identification of the closest common reference across samples. The tool uses reads, single sequences or sequence list as input and creates a distance-based phylogenetic tree. If a sequence list has a read-group it will be treated as a set of reads, otherwise the tool will group the sequences in a sequence list based on their "Assembly ID" annotation or treat the sequences individually when no "Assembly ID" annotation has been assigned. To find out how to assign Assembly ID annotation, please see Using the Assembly ID annotation. There are two ways to initiate creation of a k-mer tree: either from the Result Metadata Table (see the section on Running analysis directly from the Result Metadata Table), or from the Toolbox.

To run the Create K-mer Tree from the toolbox:

        Typing and Epidemiology (Image typing_epi_folder_closed_16_h_p) | Create K-mer Tree (Image te_kmer_tree_16_h_p)

Input files can be specified step-by-step like shown in figure 12.15 or by selecting data recursively by right-clicking on the folder name and selecting Add folder contents (recursively). If using the recursive option, remember to double check that files relevant for the downstream analysis are selected.

Image ktree1
Figure 12.15: Selection of individual reads and single sequences or sequence list to be included in the K-mer tree analysis.

Specify the following parameters (figure 12.16):

Image ktree2
Figure 12.16: Various parameters may be set before generation of a K-mer tree.

The K-mer trees are constructed using a Neighbour Joining method, which makes use of a distance function, either Jaccard Distance or Feature Frequency Profile via Jensen-Shannon divergences (FFP). In both cases, the distance can assume values between 0 (exactly same k-mer distribution) and 1 (completely different k-mer distribution).

Branch lengths depend on the distance function used. Specifically, if one sums up all the branch length of all the branches connecting two leaves, one can get the distance between the two organisms the leaves represent.



Subsections