Create Large MLST Scheme (beta)

The Create Large MLST Scheme tool can be used to create a scheme from scratch.

To run the Create Large MLST Scheme (beta) tool choose:

        Microbial Genomics Module (Image mgm_folder_closed_flat_16_h_p) | Databases (Image typing_epi_folder_closed_16_h_p) | Large MLST (Image large_mlst_open_16_h_p) | Create Large MLST Scheme (beta) (Image create_large_mlst_16_h_p)

As input, the tool requires a set of complete isolate genomes in the form of one or more sequence lists or sequences. These genomes must be annotated with coding region (CDS) annotations. If these are not available, the Find Prokaryotic Genes tool (see Find Prokaryotic Genes) can be used to predict and annotate the coding regions.

Note that when working with genomes that span several chromosomes or several contigs, it is necessary to declare which sequences belong to the same genome. This is done by ensuring all sequences from the same genome have the same 'Assembly ID' and 'Latin name' annotations. Some tools, such as the Download Microbial Reference Database, will automatically assign these annotations, for a manual assignment of Assembly ID annotations, plase see Using the Assembly ID annotation.

Image lmlst_create_scheme_step1
Figure 18.1: Basic options for creating a large mlst scheme.

After specifying the input, the second step is to set up the basic Large MLST Scheme creation parameters (figure 18.1).

The Create Large MLST Scheme tool works by extracting all annotated coding sequences (CDS) and clustering them into similar gene classes (loci). It is possible to specify whether we are interested in the genes that are present in some genomes (Whole genome - must be present in at least 10% of all genomes), most genomes (Core genome - must be present in at least 90% of the genomes), or a user-specified Minimum fraction.

The best results are obtained by supplying genomes with proper CDS annotations. The Handle genes without annotations option controls how genomes without CDS annotations are handled:

The Sequence type and locus parameters control the following filtering and naming options:

Image lmlst_create_scheme_step2
Figure 18.2: The allele grouping (clustering) options.

The Allele grouping parameters step specifies how the different genes (CDS annotations) are compared to each other. Diamond is used for this clustering. It is also possible to specify the Genetic code for the input samples.

The Minimum identity determines the minimum sequence identity before grouping protein sequences. It is also possible to specify the sensitivity of the search (Standard search, Sensitive search, and More sensitive search) - increasing the sensitivity makes the search more thorough, but also much slower. The default for this parameter is Sensitive search.

Image lmlst_create_scheme_step3
Figure 18.3: The functional annotation parameters.

It is possible to decorate the alleles with information about virulence or resistance. The information can be extracted from either a ShortBRED Marker database or a Nucleotide database. These databases can be accessed using the Download Resistance Database tool (e.g. QMI-AR for resistance or VFDB for virulence) and can be provided as input to the Create Large MLST Scheme tool at this step.

Image lmlst_create_scheme_step4
Figure 18.4: The clustering parameters.

The clustering parameters determine how the heatmap should be clustered. The heatmap cell values are the observed frequencies of a given allele compared to the other alleles in the same locus.

The possible cluster linkages are:

The possible distance measures are:

Note that for schemes with thousands of sequence types, the clustering may become very slow and time-consuming.

Image lmlst_create_scheme_step5
Figure 18.5: The minimum spanning tree parameters.

The following options are available when creating a minimum spanning tree: