Quality scores in the Illumina platform
The quality scores in the FASTQ format come in different versions. You can read more about the FASTQ format at http://en.wikipedia.org/wiki/FASTQ_format. When you select to import Illumina data and click Next there is an option to use different quality score schemes at the bottom of the dialog (see figure 6.6).
Figure 6.6: Selecting the quality score scheme.
There are three options:
- Automatic. Choosing this option, the Workbench attempts to automatically detect the quality score format. Sometimes this is not possible, and you have to specify the format yourself. In the cases where the Workbench is unable to determine the format, it is usually one of the Illumina Pipeline format files. If there are characters ; < = > or ? in the quality score information, it is the old Illumina pipeline format (ASCII values 59 to 63).
- NCBI/Sanger or Illumina 1.8 and later. Using a Phred scale encoded using ASCII 33 to 93. This is the standard for fastq formats except for the early Illumina data formats (this changed with version 1.8 of the Illumina Pipeline).
- Illumina Pipeline 1.2 and earlier. Using a Solexa/Illumina scale (-5 to 40) using ASCII 59 to 104. The Workbench automatically converts these quality scores to the Phred scale on import in order to ensure a common scale for analyses across data sets from different platforms (see details on the conversion next to the sample below).
- Illumina Pipeline 1.3 and 1.4. Using a Phred scale using ASCII 64 to 104.
- Illumina Pipeline 1.5 to 1.7. Using a Phred scale using ASCII 64 to 104. Values 0 (@) and 1 (A) are not used anymore. Value 2 (B) has special meaning and is used as a trim clipping. This means that when selecting Illumina Pipeline 1.5 and later, the reads are trimmed when a B is encountered in the input file if the Trim reads option is checked.
Small sample of all three kinds of files are shown below. The names of the reads have no influence on the quality score format:
NCBI/Sanger Phred scores:
@SRR001926.1 FC00002:7:1:111:750 length=36 TTTTTGTAAGGAGGGGGGTCATCAAAATTTGCAAAA +SRR001926.1 FC00002:7:1:111:750 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIFIIII'IB<IH @SRR001926.7 FC00002:7:1:110:453 length=36 TTATATGGAGGCTTTAAGAGTCATAGGTTGTTCCCC +SRR001926.7 FC00002:7:1:110:453 length=36 IIIIIIIIIII:'III?=IIIIII+&III/3I8F/&
Illumina Pipeline 1.2 and earlier (note the question mark at the end of line 4 - this is one of the values that are unique to the old Illumina pipeline format):
@SLXA-EAS1_89:1:1:672:654/1 GCTACGGAATAAAACCAGGAACAACAGACCCAGCA +SLXA-EAS1_89:1:1:672:654/1 cccccccccccccccccccc]c``cVcZccbSYb? @SLXA-EAS1_89:1:1:657:649/1 GCAGAAAATGGGAGTGAAAATCTCCGATGAGCAGC +SLXA-EAS1_89:1:1:657:649/1 ccccccccccbccbccb``cccbcccZcc`^bR^`The formulas used for converting the special Solexa-scale quality scores to Phred-scale:
A sample of the quality scores of the Illumina Pipeline 1.3 and 1.4:
@HWI-E4_9_30WAF:1:1:8:178 GCCAGCGGCGCAAAATGNCGGCGGCGATGACCTTC +HWI-E4_9_30WAF:1:1:8:178 babaaaa\ababaaaaREXabaaaaaaaaaaaaaa @HWI-E4_9_30WAF:1:1:8:1689 GATGGAGATCTCGACCTNATAGGTGCCCTCATCGG +HWI-E4_9_30WAF:1:1:8:1689 aab`]_aaaaaaaaaa[ER`abaaa\aaaaaaaa[Note that it is not possible to see from that data itself that it is actually not Illumina Pipeline 1.2 and earlier, since they use the same range of ASCII values.
To learn more about ASCII values, please see http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters.