Illumina

CLC Genomics Workbench supports data from Illumina's Genome Analyzer, HiSeq 2000, NextSeq and the MiSeq systems. Choosing the Illumina import will open the dialog shown in figure 6.10. This data type can also be imported using the on-the-fly import functionality described in Launching workflows individually and in batches.

Image importngsdialog-illumina
Figure 6.10: Importing data from Illumina systems.

File format

The file formats accepted are Fastq and compressed data in gzip format (*.gz). Paired data in either of these formats can be imported.

Note that there is information inside fastq files specifying whether a read has passed a quality filter or not. If you check Remove failed reads these reads will be ignored during import.

For fastq files, part of the header information for the quality score has a flag where Y means failed and N means passed. In this example, the read has not passed the quality filter:

    @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Note! In the Illumina pipeline 1.5-1.7, the letter B in the quality score has a special meaning. 'B' is used as a trim clipping. This means that when selecting Illumina pipeline 1.5-1.7, the reads are automatically trimmed when a B is encountered at either end of the reads in the input file. This will happen also if you choose to discard quality scores during import.

General Options

The General options to the left are:

Paired read information

These options become available if you selected the option "Paired reads" in the General options. First, it is very important to select the correct nature of the paired reads imported: Paired-end (forward-reverse) or Mate-pair (reverse-forward). Second, you have to specify the Minimum and Maximum distances for your pairs. The paired read distance includes the full read sequence, which means that is from the beginning of the forward read to the beginning of the reverse read (figure 6.11). The distances are usually defined during the library preparation of your sequencing experiment, but in doubt you can enter default values: for paired-end the distances are between 1 and 1000 bp while mate-pair reads typically have longer distances between 1000-5000 bp (and sometimes up to 10000). Note that the tools usually used subsequently to process Illumina reads (such as Map Reads to Reference or RNA-Seq Analysis) have an "Auto-detect paired distances" option that is enabled by default. As long as this option is used, mis-specifying the distances during import should bear no consequences.

Image distance-illumina
Figure 6.11: Green lines represent forward reads, red lines reverse reads, and in blue is shown the distance of the sequenced DNA fragment. Thus, if there is a complete overlap, the minimum distance will not be 0, but the length of the overlap.

Illumina options

Subsequent analysis can then be executed in batch on all the files, and results can be compared at the end.

Click Next to adjust how to handle the results. We recommend choosing Save in order to save the results directly to a folder, since you probably want to save anyway before proceeding with your analysis. There is an option to put the import data into a separate folder. This can be handy for better organizing subsequent analysis results and for batch processing.



Subsections