All of the k-mer based distance measures completely ignores the ordering of
the k-mers inside the input sequences. Hence, if the selected k value (the length of the sequences) is too
small, very distantly related organisms may be assigned a small
evolutionary distance (in the extreme case where k is
, two organisms will
be treated as being identical if the frequency of each
nucleotide/amino-acid is the same in the two corresponding
sequences). In the other extreme, the k-mers should have a length (k) that
is somewhat below the average distance between mismatches if the input
sequences were aligned (in the extreme case of k=the length of the
sequences, two organisms have a maximum distance if they are not
identical). Thus the selected k value should not be too large and not too
small. A general rule of thumb is to only use k-mer based distance
estimation for organisms that are not too distantly related.
Formal definition of distance. In the following, we give a more
formal definition of the three supported distance measures:
Euclidian-squared, Mahalanobis and Fractional common k-mer count. For
all three, we first
associate a point
to every input sequence
. Each point
has one coordinate for every possible length k sequence (e.g. if
represent nucleotide sequences, then
has
coordinates). The coordinate
corresponding to a length k sequence
has the value: ``number of
times
occurs as a subsequence in
''. Now for two sequences
and
, their evolutionary distance is defined as follows:
Here the standard deviations can be computed directly from a set of equilibrium frequencies for the different bases, see [Gentleman and Mullin, 1989].
Here
In experiments performed in [Höhl et al., 2007], the Mahalanobis distance measure seemed to be the best performing of the three supported measures.