Advanced workflow batching
Fine-tuned control of the execution of whole workflows or sections of workflows can be achieved using metadata describing the relationships between particular samples and using control flow elements in a workflow design. Complex analysis goals that can be met in a straightforward manner include:
- Grouping the data into different subsets to be analyzed together in particular sections of a workflow. Groupings of data can be used in the following ways:
- Different groupings of data are used as inputs to different sections of the same workflow.
For example, an end-to-end RNA-Seq workflow can be drawn, where the RNA-Seq Analysis tool could be run once per sample and the expression results for all samples could be used as input to a single downstream tool such as a statistical analysis tool. Or, given Illumina data originating from multiple lanes, QC could be run on the data from each lane individually, then the results for each sample could be merged and mapped to a relevant reference genome, and then a single QC report for the whole cohort could be created. For details, see Batching part of a workflow and Multiple levels of batching.
- Different workflow inputs follow different paths through parts of a workflow. Based on metadata, samples can be distributed into groups to follow different analysis paths in some workflow sections, at the same time as processing them individually and identically through other sections of the same workflow.
For example, a single workflow could be used to analyze sets of tumor-normal paired samples, where each sample is processed in an identical way up until the comparison step, where the matching tumor (case) and normal (control) samples are used together in an analysis tool.
Configuring Collect and Distribute elements is central to the design of this workflow. This is described in Control flow elements. Running such workflows is described in Running part of a workflow multiple times.
- Different groupings of data are used as inputs to different sections of the same workflow.
For example, an end-to-end RNA-Seq workflow can be drawn, where the RNA-Seq Analysis tool could be run once per sample and the expression results for all samples could be used as input to a single downstream tool such as a statistical analysis tool. Or, given Illumina data originating from multiple lanes, QC could be run on the data from each lane individually, then the results for each sample could be merged and mapped to a relevant reference genome, and then a single QC report for the whole cohort could be created. For details, see Batching part of a workflow and Multiple levels of batching.
- Matching particular workflow inputs for each workflow run. Where more than one input to a workflow changes per run, the particular input data to use for each run can be defined using metadata. The simplest case is as described in Batching workflows with more than one input changing per run. However, more complex scenarios, such as when intermediate results should be merged or parts of the workflow should be run multiple times, can also be catered for, as described in Matching up inputs with each other and analyzing them together later in the workflow.
Subsections