- Defining batch units for workflows with Iterate elements
- Configuring an Iterate element
- Configuring Collect and Distribute elements
Iterate and Collect and Distribute elements
Iterate () and Collect and Distribute () elements are used to control how data is grouped for analysis.
- Iterate elements are placed at the top of a branch of a workflow that should be run multiple times, using different inputs in each run. The sets of data to use in each run are referred to as "batch units" or, sometimes, "iteration units".
- Collect and Distribute elements are, optionally, placed downstream of an Iterate element, where they collect outputs from the upstream iteration block (see below) and distribute them as inputs to downstream analyses.
The steps between an Iterate element and a Collect and Distribute element are referred to as an "iteration block". The workflow in figure 13.41 contains a single iteration block (shaded in turquoise), where steps within that block are run once per batch unit. The Collect and Distribute element collects all the results from the iteration block and sends it as input to the next stage of the analysis (shaded in purple).
Figure 13.41: The roles of the Iterate and Collect and Distribute control flow elements are highlighted in the context of RNA-Seq and differential expression analyses. RNA-Seq Analysis lies downstream of an Iterate element, within an iteration block (shaded in turquoise). It will thus be run once per batch unit. Differential Expression for RNA-Seq lies immediately downstream of a Collect and Distribute element, and is sent all the expression results from the iteration block as input for a single analysis.
Defining batch units for workflows with Iterate elements
Workflow elements downstream of an Iterate element are run once for each batch unit. Details about defining batch units when launching workflows is described at Running workflows in batch mode.
Running a workflow with a single Iterate element at the top of a workflow, no downstream Collect and Distribute element, and a single Input element is equivalent to running a similar workflow without the Iterate element in Batch mode. Setting up batch units in this situation is described in Batch processing.
Configuring an Iterate element
In most cases, no configuration of Iterate elements is needed. See, for example, the RNA-Seq and Differential Gene Expression Analysis template workflow, distributed with the CLC Genomics Workbench (http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=RNA_Seq_Differential_Gene_Expression_Analysis_workflow.html).For more involved situations, Iterate elements can be configured. The options available (figure 13.42) are:
- Number of coupled inputs The number of separate inputs for each given iteration. These inputs are "coupled" in the sense that, for a given iteration, particular inputs are used together. For example, when sets of sample reads should be mapped in the same way, but each set should be mapped to a particular reference (figure 13.43).
- Error handling Specify what should happen if an error is encountered. The default is that the workflow should stop on any error. The alternative is to continue running the workflow if possible, potentially allowing later batches to be analyzed even if an earlier one fails.
- Metadata table columns If the workflow is always run with metadata tables that have the same column structure, then it can be useful to set the value of the column titles here, so the workflow wizard will preselect them. The column titles must be specified in the same order as shown in the worfklow wizard when running the workflow. Locking this parameter to a fixed value (i.e. not blank) will require the definition of batch units to be based on metadata. Locking this parameter to a blank value requires the definition of batch units to be based on the organization of input data (and not metadata).
- Primary input If the number of coupled inputs is two or more, then the primary input (used to define the batch units) can be configured using this parameter.
Figure 13.42: The number of coupled inputs in this simple example is 2, allowing each set of sample reads to be mapped to a paticular reference, rather than using the same reference for all iterations.
Figure 13.43: Reads can be mapped to specified contigs due to the 2 input channels of the Iterate element. Using this design, a single sequence list containing all the unmapped reads from all the initial inputs is generated. That would not be possible without the inclusion of the Iterate and Collect and Distribute elements.
Configuring Collect and Distribute elements
By default, a Collect and Distribute element has one output channel. In this case, all results from the iteration block are collected and passed to downstream steps of the workflow.
More than one output channel can be configured by entering terms in a comma separated list in the Outputs field (figure 13.44). The number of terms determines the number of output channels. Connections between these output channels and input channels of downstream elements determine how data should be distributed in the following stage of the workflow.
If the Collect and Distribute element has more than one output channel, the path taken by a given element is determined by the value in the metadata column specified when launching the workflow. This column can be preconfigured in the Group by metadata column setting.
Figure 13.44: A comma separated list of terms in the Outputs field of the Collect and Distribute element defines the number of output channels and their names.
For example, when launching the workflow in figure 13.45, a metadata column called "Type" was specified for defining which samples were cases and which were controls. The iteration units were defined by the contents of the "ID" column (figure 13.46).
Figure 13.45: In this workflow, each case sample is analyzed against all of the control samples.
Figure 13.46: Contents of the metadata column "Type" define which samples are cases and which are controls. Iteration units are defined by the contents of the "ID" column.