Example: Kraken2 (containerized external application)

In this section, a containerized external application for Kraken2 is described. This example illustrates the use of a publicly available Docker image in combination with an "Included script", as well as how reference databases, in this case, the pre-built Kraken 2/Bracken databases from https://benlangmead.github.io/aws-indexes/k2, can be accessed from an external application.

More information about the Kraken2 taxonomic sequence classifier can be found at
https://github.com/DerrickWood/kraken2/wiki.

Using this external application, users select a single-end, nucleotide sequence list and a public Kraken2 database to use in the analysis. The outputs include a Kraken2 report, a file containing other classification information, sequences lists for classified and unclassified sequences, and a Docker log. See Extending the Kraken2 external application for paired data for information about extending this external application configuration to create one able to handle paired sequences.

All activity takes place in a Docker container. No local installation of scripts or reference databases is done for this example.

Defining the Kraken2 command line and configuring the parameters

A command line and general configuration of a Kraken2 external application are shown in figure 16.30.

Image external_app_kraken2_command_definition
Figure 16.30: Configuration a containerized external application for Kraken2.

The external application type is set to "Containerized: Docker". Thus, the information in the "Command line" field will be appended to the command specified in the Containerized execution environment settings for the CLC Server. The Docker image from GitHub staphb/kraken2:2.1.2-no-db is the Kraken2 2.1.1 (no db) image . Further details about this can be found at https://hub.docker.com/r/staphb/kraken2.

The parameters in curly brackets in the command line are substituted at run time, either with values specified by the user or values specified in the configuration. Here, this includes an "Included script", which contains commands run within the container, and parameters relating to the input data, results and Docker logging information.

Further details about parameters with values substituted at run time:

The remaining 4 parameters, described below, are configured with the type Output from CL. This is used for outputs generated by the application. How each output should be handled is part of the configuration. This can include configuring import, some types of post-processing as well as choosing not to do anything with that output. In this example, we choose to import each of the 4 outputs created by the kraken2 command line in the kraken2.sh script.

Image external_app_kraken2_included_script
Figure 16.32: A script is defined that will be run in the Docker container. This version includes the commands to download and unpack the Kraken2 database specified by the user when launching the application.

Image external_app_kraken2_wizard
Figure 16.33: The CSV enum parameter type used for the Kraken 2 database parameter results in a drop-down list of options in the Workbench wizard for the application.

Some other options for working with reference database from an external application

In this example, a user specifies the database to use from a drop-down list, and that database is then obtained from a corresponding URL, as defined in a "CSV enum" type parameter. Some other methods that support the use of external files include:

Settings under the Stream handling tab

Under the Stream handling tab, we define how information sent to standard out and standard error should be handled. Depending on the application, information in these streams can be useful for troubleshooting.

Image external_app_kraken2_stream_handling
Figure 16.34: The CSV enum parameter type used for the Kraken 2 database parameter results in a drop-down list of options in the Workbench wizard for the application.

Image external_app_kraken2_draw_workflow
Figure 16.35: A workflow containing the external application. In workflows, the outputs to collect can be specified. Here, all outputs except the unclassified sequences will be returned by the analysis.

Making the external application available for use

When this external application is saved, it becomes available to run on the CLC Server unless its status is set to Disabled.

To run external applications on the cloud, the CLC Cloud Module is needed. See https://resources.qiagenbioinformatics.com/manuals/clccloudmodule/current/index.php?manual=Using_external_applications_on_cloud_via_CLC_Workbench.html. The external applications need to be installed on the CLC Workbench that will be used for submitting jobs or for creating workflows that contain the external application. To do this, export the configuration(s) to an AWS S3 bucket accessible from the CLC Workbench. From the CLC Workbench, right-click on the configuration file in AWS S3 under the Remote Files tab, and choose the option Install External Applications. The external application(s) in the configuration file will be installed and made available from the External Applications Cloud folder under the Tools menu.

To export to a cloud location, a valid AWS Connection must be configured on the CLC Server, as described in AWS Connections in the CLC Server. To install external applications from an AWS S3 location, a valid AWS Connection must be configured in the CLC Workbench, as described at https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=AWS_Connections.html.

Reminder: If you plan to run external applications on the cloud via the CLC Server Command Line Tools, they must be included in a workflow, and that workflow must be installed on the CLC Server.

Of note when creating workflows: options are usually locked by default. To unlock parameters in a workflow element, double click on the central part of the element, or right-click on it, and choose the option Configure.... Then check for the lock/unlock icon beside each setting. See figure 16.36.

Image external_app_kraken2_configure_workflow
Figure 16.36: By default, the database parameter will be locked in the workflow element. Configure the workflow element to unlock this parameter to allow users to select a database when launching the workflow.

Results from the Kraken2 external application

The outputs from Kraken2 and the standard error information written by Docker are available in CLC format when the application has finished. A log of the job is also available. If the external application was run on the cloud, a file called workflow-result.json will also be present among the results (figure 16.37).

Image external_app_kraken2_outputs
Figure 16.37: Outputs from a Kraken2 containerized external application after running it on AWS using functionality provided by the CLC Cloud module.

These results will be in the location specified by the user when launching the application. If the job was run on a CLC Server, that will be in a CLC Server location. If the job was run on the cloud, the results will be in an AWS S3 location.

Interacting with files on AWS S3 via a CLC Workbench is described at https://resources.qiagenbioinformatics.com/manuals/clccloudmodule/current/index.php?manual=Working_with_AWS_S3_using_Remote_Files_tab.html. Interacting with files on AWS S3 via the CLC Server web interface is described in Browse AWS S3 locations.



Subsections