Example: MAFFT (containerized external application)
In this section, a containerized external application for MAFFT, an alignment program for amino acid or nucleotide sequences, is described. This example illustrates the use of a publicly available Docker image in combination with an "Included script" and extension of the external application command line beyond a single line.
More information about MAFFT is available from https://mafft.cbrc.jp/alignment/software/.
Using the resulting external application, a CLC software user specifies sequences to align and a location to save the alignment to. They can also optionally configure MAFFT settings. The alignment is run in a Docker container, and the results are then imported back into the CLC software.
Defining the MAFFT command line and configuring the parameters
The command line and general configuration of a MAFFT external application are shown in figure 15.44.
Figure 15.44: Configuration a containerized external application for MAFFT. A publicly available Ubuntu image is pulled and all subsequent information in the command is run in the container created.
The external application type is set to "Containerized: Docker". Thus, the information in the "Command line" field will be appended to the command specified in the Containerized execution environment settings for the CLC Server.
The parameters in curly brackets are substituted at run time with the values specified.
Getting the MAFF software into the external application
In this example, the steps to obtain and unpack MAFFT are run in the Docker container. There are other ways this can be done. Choices for how to get the MAFFT software running in the container include:
- Have in an included script the steps for downloading MAFFT from the software maintainer's site and unpacking it (as done in this example).
- Download MAFFT to a location accessible to the external application (e.g. your S3 bucket if you are running the external application on the cloud), and access it from there, either:
- By copying it into the container, using a command in the included script, or
- By providing the path to the MAFFT binary as the value for an "External file" parameter type.
A key difference between these 2 options is that users will not see, nor be able to alter, information in an included script, while values in an External file parameter are visible and configurable when launching the external application. Note though that when the external application is included in a workflow, you can define whether or not this option should be configurable by users (figure 15.49).
Extending the command line with /bin/bash -c
In this example, the MAFFT software (mafft.bat
) is called using a line in the external application command line field (after the semi-colon), rather than being included within the script. This approach can make writing the script simpler, and may make it easier for external application authors, to keep track of the roles of the various parameters being passed to the bioinformatics application.
For comparison, see the Kraken2 external application example for a case where all commands are contained in an included script.
Both approaches are equally valid.
Further details about parameters with values substituted at run time:
- install-and-run-mafft.sh A script to be executed in the Docker container. The parameter type is set to "Included script" in the General configuration area.
The "Included script" contents are entered by clicking on the Edit contents button in the General configuration area. The activities defined in the script include downloading and unpacking MAFFT (figure 15.45).
Here,
{install-and-run-mafft.sh}
is specified as the argument to/bin/bash -c
in the Command line field, allowing commands external to the included script to be run after those in the script have completed. Here, themafft.bat
command is appended this way. - MAFFT settings A Text field where a user can configure parameters to pass to the MAFFT software. How this looks in a CLC Workbench is shown in figure 15.46. There is no validation of the information entered into a Text field when launching, so this approach assumes users know about the MAFFT program.
- Sequences to align MAFFT expects sequences in FASTA format. We thus specify that data will be selected from a CLC location, and exported to FASTA format.
- Alignment This defines the output from the external command. The output from MAFFT is in FASTA alignment format and we specify the name of the file to create. The external application then knows what importer to use to import the results into the CLC software.
See External command information for more information about external application parameter types.
Figure 15.45: A script is defined that will be run in the Docker container. It includes the steps needed to make MAFFT available to run in the container.
Figure 15.46: The wizard presented to Workbench users when they launch the MAFFT external application. They select the sequences they wish to align, and can, if they wish, edit the options being passed to MAFFT.
Settings under the Stream handling tab
Under the Stream handling tab, we define how information sent to standard out and standard error should be handled. The information in these streams can be useful for troubleshooting. The settings in this example are shown in figure 15.47.
- Standard out handling MAFFT reports its progress to standard out. This could be useful for troubleshooting, so we choose to collect this information and then import it using the plain text importer. The name of the file to write the information to is specified as "MAFFT-stdout.txt".
- Standard error handling Docker reports its progress to standard error. This information can be useful for troubleshooting. We thus indicate that execution should not be stopped when information is written to this stream. The name of the file to write the information to is specified as "Docker-log.tx", and we configured it to be imported using the plain text importer.
The base names of these files are used to name the output channels of the corresponding workflow element (figure 15.48).
A general note about names
External application names and the names of parameters that refer to inputs a user will select are presented to users and administrators in various places. The names of files containing standard out or standard error information are also visible. For example, here, the name, "MAFFT", will be the name used in the External Applications section of the server web administrative interface, as well as in the CLC Workbench Toolbox and the corresponding workflow element (figure 15.48). "Sequences to align", will be used in the CLC Workbench wizard (figure 15.46) and in the corresponding workflow element.
Figure 15.47: Defining stream handling for the the MAFFT containerized external application.
Figure 15.48: A workflow containing the external application. In workflows, the outputs to collect can be specified. Here, all the outputs are configured to be saved.
Making the external application available for use
When this external application is saved, it becomes available to run on the CLC Server unless its status is set to Disabled.
To run external applications on the cloud, the CLC Cloud Module is needed. See https://resources.qiagenbioinformatics.com/manuals/clccloudmodule/current/index.php?manual=Using_external_applications_on_cloud_via_CLC_Workbench.html. The external applications need to be installed on the CLC Workbench that will be used for submitting jobs or for creating workflows that contain the external application. To do this, export the configuration(s) to an AWS S3 bucket accessible from the CLC Workbench. From the CLC Workbench, right-click on the configuration file in AWS S3 under the Remote Files tab, and choose the option Install External Applications. The external application(s) in the configuration file will be installed and made available from the External Applications Cloud folder in the Toolbox.
To export to a cloud location, a valid AWS Connection must be configured on the CLC Server, as described in AWS Connections in the CLC Server. To install external applications from an AWS S3 location, a valid AWS Connection must be configured in the CLC Workbench, as described at https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=AWS_Connections.html.
Reminder: If you plan to run external applications on the cloud via the CLC Server Command Line Tools, they must be included in a workflow, and that workflow must be installed on the CLC Server.
Of note when creating workflows: options are usually locked by default. To unlock parameters in a workflow element, double click on the central part of the element, or right-click on it, and choose the option Configure.... Then check for the lock/unlock icon beside each setting. See figure 15.49.
Figure 15.49: By default, parameters are locked in workflow elements, as shown here. To allow users to configure the "MAFFT settings" option when launching this workflow, the parameter must be unlocked.
Results from the MAFFT external application
The alignment and the text files containing the standard out and standard error information are available in CLC format when the application has finished. A log of the job is also available. If the external application was run on the cloud, a file called workflow-result.json will also be present among the results (figure 15.50).
Figure 15.50: Outputs from a MAFFT containerized external application after running it on AWS using functionality provided by the CLC Cloud module.
These results will be in the location specified by the user when launching the application. If the job was run on a CLC Server, that will be in a CLC Server location. If the job was run on the cloud, the results will be in an AWS S3 location.
Interacting with files on AWS S3 via a CLC Workbench is described at https://resources.qiagenbioinformatics.com/manuals/clccloudmodule/current/index.php?manual=Working_with_AWS_S3_using_Remote_Files_tab.html. Interacting with files on AWS S3 via the CLC Server web interface is described in Browse AWS S3 locations.