Create Large MLST Scheme

The Create Large MLST Scheme tool can be used to create a scheme from scratch.

To run the Create Large MLST Scheme tool choose:

        Microbial Genomics Module (Image mgm_folder_closed_flat_16_h_p) | Databases (Image databases_folder_closed_16_n_p) | Large MLST (Image large_mlst_open_16_h_p) | Create Large MLST Scheme (Image create_large_mlst_16_h_p)

As input, the tool requires a set of complete isolate genomes in the form of one or more sequence lists or sequences. At least one of these genomes must be annotated with coding region (CDS) annotations. If these are not available, the Find Prokaryotic Genes tool (see Find Prokaryotic Genes) or Annotate with DIAMOND (see Annotate with DIAMOND) can be used to predict and annotate the coding regions.

In the first wizard step shown in figure 18.1 the grouping of sequences into genomic units can be controlled. This is necessary when working with genomes that span several chromosomes or several contigs for the tool to consider these as one unit. The grouping can be controlled by the Assembly grouping field:

Image create_large_mlst_annotation_grouping
Figure 18.1: Grouping the input into assemblies.

After specifying the input, the second step is to set up the basic Large MLST Scheme creation parameters (figure 18.2).

The Create Large MLST Scheme tool works by extracting all annotated coding sequences (CDS) and clustering them into similar gene classes (loci). It is possible to specify whether we are interested in the genes that are present in some genomes (Whole genome - must be present in at least 10% of all genomes), most genomes (Core genome - must be present in at least 90% of the genomes), or a user-specified Minimum fraction.

Image lmlst_create_scheme_step1
Figure 18.2: Basic options for creating a large MLST scheme.

The best results are obtained by supplying genomes with proper CDS annotations. The Handle genes without annotations option controls how genomes without CDS annotations and how existing CDS may be overridden if a longer CDS from another genome exactly matches the genomic sequence.

The Allele grouping parameters step (figure 18.3) specifies how the different genes (CDS annotations) are compared to each other. DIAMOND is used for this clustering. The following can be specified:

Image lmlst_create_scheme_step2
Figure 18.3: The allele grouping (clustering) options.

Note that after clustering, length outliers of a given cluster are removed by applying Tukey's fences with an interquartile range of 1.5, yet allowing for 5% length variation around the median. For example, for an allele cluster (locus) with allele lengths 51, 51, 51, 51, 53, the latter allele will not be removed although it falls outside the 1.5 IQR (both the first and third quartile are 51) since it is still within 5% of the median, for 48, 51, 51, 54, 63, only the former four will be included.

It is possible to decorate the alleles with information about virulence or resistance. The information can be extracted from either a ShortBRED Marker database or a Nucleotide database. These databases can be accessed using the Download Resistance Database tool (e.g. QMI-AR for resistance or VFDB for virulence) and can be provided as input to the Create Large MLST Scheme tool at this step (figure 18.4).

Image lmlst_create_scheme_step3
Figure 18.4: The functional annotation parameters.