CLC Genomics Workbench supports data from Illumina's Genome Analyzer, HiSeq 2000, NextSeq and the MiSeq systems. Choosing the Illumina importer opens the dialog shown in figure 6.10. This data type can also be imported using the on-the-fly import functionality described in Launching workflows individually and in batches.

Figure 6.10: Importing data from Illumina systems.

File format

Fastq format files and fastq format files that compressed using gzip (.gz), zip (.zip) or bzip2 (.bz2) can be imported using the Illumina importer.

General Options

The settings in the General options area of the dialog are:

Paired read information

When the Paired reads option is enabled, options in the "Paired read information" section of the dialog can be edited. Here, you specify the type of paired data, Paired-end (forward-reverse) or Mate-pair (reverse-forward), and the expected distance range.

In the Workbench, the only difference between paired-end (forward-reverse) or mate-pair (reverse-forward) is the expected orientation of the reads: forward-reverse in the case of paired end data and reverse-forward in the case of mate pairs.

The paired read distance includes the full read sequence, i.e. from the beginning of the forward read to the beginning of the reverse read (figure 6.11). The distances are usually defined during the library preparation of your sequencing experiment, but in doubt you can enter default values: for paired-end the distances are between 1 and 1000 bp while mate-pair reads typically have longer distances between 1000-5000 bp (and sometimes up to 10000). Note that the tools usually used subsequently to process Illumina reads (such as Map Reads to Reference or RNA-Seq Analysis) have an "Auto-detect paired distances" option that is enabled by default. As long as this option is used, mis-specifying the distances during import should bear no consequences.

Read more about handling paired data.

Figure 6.11: Green lines represent forward reads, red lines reverse reads, and in blue is shown the distance of the sequenced DNA fragment. Thus, if there is a complete overlap, the minimum distance will not be 0, but the length of the overlap.

Illumina options

In the next wizard step, options are presented for how to handle the results. If you choose to Save the results, an option called "Create subfolders per batch unit" becomes available. When that option is checked, each sequence list is saved into a separate folder under the location selected to save results to. This can be useful for organizing subsequent analysis results and for batch processing.