Batching workflows with more than one input changing per run
When a workflow contains multiple input elements (multiple bright green boxes), a Batch checkbox will be available in each of the wizard steps for selecting input data. Checking that box for a given input step indicates that the data for that input should change in each batch run. Data selected for inputs where the Batch checkbox is not checked are considered as a single set that should be used for that workflow input for all of the batch runs.
Where more than one input will change per batch run, batch units are defined using metadata. This is most easily explained using an example. Figure 10.38 shows a workflow with a Map Reads to Contigs element and two workflow input elements, Sample Reads and Reference Sequences. This workflow can be used to map particular sets of reads to particular references. In this example, the metadata is provided by two Excel files, one containing the information for the Sample Reads input data and one with information about the Reference Sequences input data.
The contents of Excel files that would work in this circumstance are shown in figure 10.39. Of particular note are:
- The first column of each of the Excel files contains the exact data file names for all the data that should be used for that input across all of the batch runs.
- At least one column in each file has the same name as a column in the other file. That column should contain the information needed to match the Sample Reads input data with the relevant Reference Sequences input data for each batch run.
Figure 10.38: A workflow with 2 inputs, where the Batch checkbox had been checked for both in the initial launch steps. Metadata is used to define the batch units since the correct inputs must be matched together for each run. Clicking on the plain folder icon brings up the option to import an external file, like an Excel file. The folder icon with the magnifying glass on it indicates that you can select an item from the Navigation Area, like a metadata table.
Figure 10.39: Two Excel files containing information about the data to be used in each batch run for the workflow shown in figure 10.38. With the settings selected there, the number of batch runs will be based on the Sample Reads input, and will equal the number of unique SRR_ID entries in the DrosophilaMultiReference.xlsx file. The correct reference sequence to map to is determined by matching information in the Reference column of each Excel file.
In the Workflow-level batch configuration area, the following are specified:
- The primary input. The input that determines the number of times the workflow should be run.
- The column in the metadata for that primary input specifying the group the data belongs to. Each group makes up a single batch unit.
- The column in both metadata files that together will be used to ensure that the correct data from each workflow input are included together in a given batch run. For example, a given set of sample reads will be mapped to the correct reference sequence. A column with this name must be present in each metadata file or table.
In the example in figure 10.38, Sample Reads is the primary input: We wish to run the workflow once for each sample. We wish to run the workflow once for each SRR_ID entry, and the Reference sequence to use for each of these batch runs is defined in a column called Reference, which is present in both the Excel file containing information about the samples and the Excel file containing information about the references.