Illumina

CLC Genomics Workbench supports data from Illumina's Genome Analyzer, HiSeq 2000, NextSeq and the MiSeq systems.

To launch the Illumina importer, go to:

        Import (Image Next_Folder_16_n_p) | Illumina (Image color_ngs_import_16_n_p).

This opens a dialog where files can be selected and import options specified (figure 7.9).

Image importngsdialog-illumina
Figure 7.9: Importing data from Illumina systems.

Fastq (.fastq/.fq) files from Illumina can be imported. Uncompressed files as well as files compressed using gzip (.gz), zip (.zip) or bzip2 (.bz2) can be provided as input. The importer processes UMI information from the fastq read headers, see General notes on UMIs.

The drop down menu of input file locations includes the option BaseSpace. When selected, an Access BaseSpace... button is presented. Clicking this opens a browser window, where your Illumina BaseSpace credentials can be entered. After doing that, granting the CLC Workbench relevant access permissions and closing the browser window, you will be able to select files from BaseSpace in the Illumina High-Throughput Sequencing Import wizard. Your BaseSpace credentials remain valid for your current CLC Workbench session. BaseSpace configuration options are available in Preferences, see Advanced preferences.

The General options are:

Default rules for determining pairs of files

First, the selected files are sorted based on the file names. Sorting is alphanumeric, except for files coming off the CASAVA1.8 pipeline, where pairs are organized according to their identifier and chunk number.

For example, for files from CASAVA1.8, files with base names like: ID_R1_001, ID_R1_002, ID_R2_001, ID_R2_002, the files would be sorted in the order below, where it is assumed that files with names containing "R1" contain the first sequences of the pairs, and those containing "R2" in the name contain the second sequence of the pairs.

  1. ID_R1_001
  2. ID_R2_001
  3. ID_R1_002
  4. ID_R2_002

In this example, the data in files ID_R1_001 and ID_R2_001 are treated as a pair, and ID_R1_002, ID_R2_002 are treated as a pair.

The file names are then used to check if each prospective file pair in this sorted list is valid. If the files in a pair seem to follow the following naming format:
<sample name>_L<at least one digit>_[R1|R2]_<at least one digit>,
then the files must contain the same sample name and lane information, in order to be valid.

If a prospective file pair does not follow this format, but the first file name does contain "_R1" and the second file name does contain "_R2", then the file pair is still considered valid. Note that if "_R1" or "_R2" occur more than once in a filename, the last occurrence in the name is used.

No data will be imported from file pairs that are not considered valid wrt. the above requirements. For such file pairs, a message will be printed in the log.

If the Join reads from different lanes option, in the Illumina options section of the dialog, is checked, then valid pairs of files with the same lane information in their file names will be imported into the same sequence list. If a valid pair of files do not contain the same lane information in their names, then no data is imported from those files and a message is printed in the log.

Within each file, the first read of a pair will have a 1 somewhere in the information line. In most cases, this will be a /1 at the end of the read name. In some cases though (e.g. CASAVA1.8), there will be a 1 elsewhere in the information line for each sequence. Similarly, the second read of a pair will have a 2 somewhere in the information line - either a /2 at the end of the read name, or a 2 elsewhere in the information line.

The organization of the files can be customized using the Custom read structure field, described in Illumina options below.

Illumina options

In the next wizard step, options are presented for how to handle the results. If you choose to Save the results, an option called "Create subfolders per batch unit" becomes available. When that option is checked, each sequence list is saved into a separate folder under the location selected to save results to. This can be useful for organizing subsequent analysis results and for batch processing.



Subsections