Running workflows in batch mode

Workflows can be run in batch mode by:

In both cases, after selecting all of the inputs to be used for all batches, the grouping of the data into batch units must be defined.

In the simplest case, where just one workflow input uses different data in each batch run, batch units can be defined based on metadata, or they can be derived from the organization of the data in a CLC software area, as described for launching analysis tools in Running the analysis and organizing the results.

For more complex scenarios, batch units are defined based on metadata, for example, when more than one workflow input uses different data in each batch run, as described in Batching workflows with more than one input changing per run, or where just parts of the workflow should run in batches, as described in Advanced workflow batching,

Defining batch units based on metadata

When launching a workflow to run in batch mode, there are two formats metadata can be provided in: an Excel spreadsheet or a CLC metadata table.

  1. Information in an Excel spreadsheet can be used to define batch units for any workflow launched in batch mode. The metadata in that file is imported into the CLC software at the start of the workflow run.

    The data is matched with the metadata based on the contents of the first column of the Excel file. That column must contain either the exact names of the data files or a unique prefix, that is, at least enough of the first part of the file name to identify it uniquely. If data is selected from the Navigation Area, the column contents are matched against the data element names. If data is imported on the fly, the column contents are matched against the names of the files being imported. The full file name can include file extensions, but not the path to the data.

    For example, if a data element selected in the Navigation Area has the name
    Tumor_SRR719299_1 (paired) (Reads), then the first column could contain that name in full, or just enough of the first part of the name to uniquely identify it. This could be, for example, Tumor_SRR719299. Similarly, if a data file selected for on-the-fly import is at:
    C:\Users\username\My Data\Tumor_SRR719299_1.fastq, the first column of the Excel spreadsheet could contain Tumor_SRR719299_1.fastq, or a prefix long enough to uniquely identify the file, e.g. Tumor_SRR719299.

    Providing metadata in an Excel format file is often the most convenient route, and it is the only option available if you are importing the data using the on-the-fly functionality when the workflow is started.

  2. A CLC metadata table with relevant data elements associated to it can be used when those data elements have been selected from the Navigation Area as inputs. How to create a metadata table is described in Importing metadata.

Defining batch units based on the contents of an Excel format file is illustrated in figure 11.38. There, a workflow with a single input is being launched in batch mode. Eight files containing Illumina reads had been selected as input and the "Batch" checkbox ticked. In the step shown, the option to define batch units based on metadata is the only one available, as the data will be imported using the on-the-fly import functionality. An Excel format file containing metadata has been selected, and then the column SRR_ID from that file has been selected as the basis of the batch units.

Image single_input_batch_with_metadata
Figure 11.38: Configuring batch units using metadata. As the data will be imported from external files, the metadata defining the batch units must also be imported from an external file, in this case, an Excel file with the names of the data files in the first column. A column with information defining the grouping of the samples for analysis, the batch units, is then selected. Here, that is the SRR_ID column.

In the next step, a preview of the batch units is shown. The workflow will be run once for each row shown in the left side of the preview, with the input data grouped as shown in the right hand column. See figure 11.39.

Image workflow_metadata_batch_overview
Figure 11.39: The Batch overview step of the wizard allows you to review the batch units configured. In the top image, a column called SRR_ID had been selected, resulting in 8 batch units, so 8 workflow runs, with the data from one input file to be used in each batch. In the lower image, a column defining different batch units was selected. There, the workflow would be run 3 times with the input data grouped into 3 batches.