Here, we expand upon the configuration described in the Kraken2 example to create a Kraken2 external application for analyzing paired sequence data.
The steps involved in updating the original configuration, which handled single end data, involve:
- Updating the external application command, changing settings where needed and adding any new parameters necessary.
- Linking outputs from Kraken2 to an NGS importer that can correctly import pairs of files containing paired end sequences into sequence lists in the CLC software.
- Editing the included script to updated the
Kraken2command so it handles paired data, and ensuring it handles all the inputs and outputs as needed.
Below, we describe these steps in the order listed, but the order they are carried out does not matter.
The command and configuration shown in figure 15.39 takes into account how Kraken2 handles paired sequence data and how CLC software handles paired sequence data.
- Kraken2 expects paired data in 2 files, one containing the first member of each pair, and another containing the second member of each pair.
- Kraken2 returns paired sequence data in 2 files, one containing the first member of each pair, and another containing the second member of each pair.
- The FASTQ exporter can export paired sequence data to 2 files. (The FASTA importer, used in the single end example, cannot.)
- Each input parameter in the external application general command represents a single input that a CLC software user will be prompted for. Here, a sequence list selected by the user will be exported to 2 FASTQ format files, which are then passed to the included script. We must handle this situation, where a single parameter in the command is associated with more than one file.
- The Illumina high-throughput sequence importer can import pairs of files as paired sequence data. Information in the filenames is used to determine which file contains the first member of a pair and which contains the second member of a pair.
Using this knowledge, we can adjust the external application accordingly:
Sequences to analyzeparameter is configured to use the FASTQ exporter, to export paired sequence data to 2 files (figure 15.38).
Sequences to analyzeparameter is used to pass 2 files to the included script, so that parameter is put in quotation marks in the Command line field (figure 15.39).
- For each set of paired sequences generated by Kraken2, there are 2 parameters in the external application command line, one for each file (figure 15.39).
The parameter names for outputs from the third party tool are not important, as they are not seen by end users, however, for transparency, we reflected our assumption that the data provided will be in forward-reverse orientation in the parameter names, i.e. |Forward classified seqs|,
Reverse classified seqs,
Forward unclassified seqs, and
Reverse unclassified seqs.
- For the sequence output parameters, we supply filenames that match our desired handling of the FASTQ files by the Illumina high-through sequencing importer: files with the first member of each pair have names ending with
_R1.fastq, files with the second member of each pair have names ending with
For details of how filenames are interpreted, please refer to http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual= llumina.html.
- We link the parameters for
Kraken2sequence outputs with the Illumina high-throughput sequencing importer. Details of how this is done are provided below.
Figure 15.38: Clicking on the "Edit parameters" button for the "Sequences to analyze" parameter opens up a configuration window for the FASTQ exporter. The default settings are shown here, with the "Export paired sequence lists in two files" option enabled.
To handle the import of paired reads, a high-throughput sequence (NGS) importer is needed. Such importers require more configuration than the standard importers, i.e. those that can be directly selected in the "General configuration" area.
The steps to configure the import the 2 FASTQ files containing the classified sequences output by
Kraken2 are described below. The same steps then need to be done to configure the import of the unclassified sequences.
- Use the default setting: "No standard import or map to high-throughput sequencing importer" for the
Forward classified seqsand
Reverse classified seqsparameters.
- Click on the High-throughput sequencing import / Post-processing tab (below the general configuration area) to open it.
- Click on Add new and select an importer. Here, we select the Illumina importer (figure 15.40).
- Click on Edit and map parameters....
- Configure the importer.
Inputs All parameters in the Command line that are configured as type "Output file from CL" are available for selection as inputs to the importer. Select the 2 for the classified sequences output:
Forward classified seqsand
Reverse classified seqs(figure 15.41).
Other importer settings Keep the default settings for all the other parameters.
- Click on the OK button to finish and save the importer configuration.
Forward classified seqs and
Reverse classified seqs parameters in the "General configuration" area will be updated with the link to the Illumina importer (figure 15.42). If you don't see this immediately, save the external application configuration and re-open it.
The included script must be updated:
Kraken2command needs to be adjusted and changes made to the external application configuration need to be handled.
Changes to the
--pairedhas been added to indicate that the sequences being provided are paired.
k2-unclassified#.txtare given as values for the parameters
#symbol is a convention used by
Kraken2when working with paired data.
Kraken2 command line is described at
The other changes in the script are to handle the increased number of files being output by Kraken2 (figure 15.43).
Figure 15.43: The kraken2-paired.sh script, with the Kraken2 command updated to indicated that the data to analyze is paired, and with variables added to handle paired sequence data being passed to the script and out of the script.