Illumina

CLC Genomics Workbench supports data from Illumina's Genome Analyzer, HiSeq 2000, NextSeq and the MiSeq systems. Choosing the Illumina importer opens the dialog shown in figure 7.9. Fastq format files and fastq format files compressed using gzip (.gz), zip (.zip) or bzip2 (.bz2) can be imported.

This importer is also available when a workflow is launched, and the Select file for import option is selected (aka on-the-fly import). Choose "Illumina" from the drop down menu beside that option. See Launching workflows individually and in batches for further details.

For this importer, the drop down menu of input file locations includes the option BaseSpace. When selected, an Access BaseSpace... button is presented. Clicking on this opens a browser window, where your Illumina BaseSpace credentials can be entered. After doing that, granting the Workbench relevant access permissions and closing the browser window, you will be able to select files from BaseSpace in the Illumina High-Throughput Sequencing Import wizard. Your BaseSpace credentials remain valid for your current Workbench session.

No prior configuration is necessary to import from BaseSpace, but configuration options are available in Preferences (see Advanced preferences).

Image importngsdialog-illumina
Figure 7.9: Importing data from Illumina systems.

General options for the Illumina importer

The settings in the General options area of the dialog are:

Default rules for determining pairs of files

First, the selected files are sorted based on the file names. Sorting is alphanumeric, except for files coming off the CASAVA1.8 pipeline, where pairs are organized according to their identifier and chunk number.

For example, for files from CASAVA1.8, files with base names like: ID_R1_001, ID_R1_002, ID_R2_001, ID_R2_002, the files would be sorted in the order below, where it is assumed that files with names containing "R1" contain the first sequences of the pairs, and those containing "R2" in the name contain the second sequence of the pairs.

  1. ID_R1_001
  2. ID_R2_001
  3. ID_R1_002
  4. ID_R2_002

In this example, the data in files ID_R1_001 and ID_R2_001 are treated as a pair, and ID_R1_002, ID_R2_002 are treated as a pair.

The following checks are then carried out for each prospective pair of files to determine whether those files form a valid pair:

If the Join reads from different lanes option, in the Illumina options section of the dialog, is enabled, then valid pairs of files with the same lane information in their file names will be imported into the same sequence list. If a valid pair of files do not contain the same lane information in their names, then no data is imported from those files and a message is printed in the log.

Within each file, the first read of a pair will have a 1 somewhere in the information line. In most cases, this will be a /1 at the end of the read name. In some cases though (e.g. CASAVA1.8), there will be a 1 elsewhere in the information line for each sequence. Similarly, the second read of a pair will have a 2 somewhere in the information line - either a /2 at the end of the read name, or a 2 elsewhere in the information line.

Discard read names. For high-throughput sequencing data, the naming of the individual reads is often irrelevant given the huge amount of reads. This option allows you to discard read names to save disk space.

Note: If you do not choose to discard read names, you can quickly check that the imported data contains the expected pairs by looking at the first few sequence names of the imported sequence list in the CLC Genomics Workbench. The first two sequences should have the same name, except for a 1 or a 2 somewhere in the read name line.

Discard quality scores. Quality scores are visible in the mapping view and they are used during variant detection. If this is not relevant for your work, you can enable the Discard quality scores option. This can reduce disk space usage and memory consumption. Read more about the quality scores of Illumina data below.

Paired read information

When the Paired reads option is enabled, options in the "Paired read information" section of the dialog can be edited. Here, you specify the type of paired data, Paired-end (forward-reverse) or Mate-pair (reverse-forward), and the expected distance range.

In the Workbench, the only difference between paired-end (forward-reverse) or mate-pair (reverse-forward) is the expected orientation of the reads: forward-reverse in the case of paired end data and reverse-forward in the case of mate pairs.

The paired read distance includes the full read sequence, i.e. from the beginning of the forward read to the beginning of the reverse read (figure 7.10). The distances are usually defined during the library preparation of your sequencing experiment, but in doubt you can enter default values: for paired-end the distances are between 1 and 1000 bp while mate-pair reads typically have longer distances between 1000-5000 bp (and sometimes up to 10000). Note that the tools usually used subsequently to process Illumina reads (such as Map Reads to Reference or RNA-Seq Analysis) have an "Auto-detect paired distances" option that is enabled by default. As long as this option is used, mis-specifying the distances during import should bear no consequences.

Read more about handling paired data.

Image distance-illumina
Figure 7.10: Green lines represent forward reads, red lines reverse reads, and in blue is shown the distance of the sequenced DNA fragment. Thus, if there is a complete overlap, the minimum distance will not be 0, but the length of the overlap.

Illumina options

In the next wizard step, options are presented for how to handle the results. If you choose to Save the results, an option called "Create subfolders per batch unit" becomes available. When that option is checked, each sequence list is saved into a separate folder under the location selected to save results to. This can be useful for organizing subsequent analysis results and for batch processing.



Subsections