Color space file formats

The .csfasta file format is often used for color space data. That format looks like this:

#picked reads from data/reads/SHIRAZ_20080320_MP_2_Sample1_F3.csfasta.original, panel range: 600 - 6
09
>600_50_31_F3
T2222002113300322132112231
>600_50_63_F3
T2330133212130133221033110
>600_50_100_F3
T0130001131012310201000101
>600_50_170_F3
T1002312103033121321233103
>600_50_174_F3
T0330022330332000323031121
>600_50_241_F3
T2103103103100212123030011
>600_50_256_F3
T0301131010233311200223332
>600_50_329_F3
T1303211033112301303220000
>600_50_342_F3
T2100003012212000310130111
...

So, it is very similar to the fasta file format. It does, however, allow one or more lines starting with # before the first sequence. The sequences are specified as a nucleotide followed by the colors encoded as numbers where 0 is blue, 1 is green, 2 is yellow, and 3 is red. So the sequence:

Image colorspace10

Would be coded like this in a .csfasta file:

>sequence
T3122013131

The T is the nucleotide that is known from the primer and the numbers indicate the colors. Because the T came from the primer, it is not part of the sequenced DNA molecule. Thus, this letter should be ignored when analyzing the read. So this sequence would look like this in .fasta format:

>sequence
ACTCCATGCA
So there is one nucleotide for each experimentally determined color (i.e.numbers in the .csfasta file).

The .csfasta does not contain any significant information that is not also present in a standard fasta file of the same sequences. The only extra information is the last nucleotide of the primer, which is not useful in later analyses.

So from the viewpoint of software programs analyzing read data, color space is just yet another file format for reads along with .fasta, .fastq, .sff, etc. Thus, in the Assembly Cell programs, color space options for assembly have no connection to file formats. You can choose to assemble SOLiD data in .csfasta format without using the color space options for assembly and you can also choose to assemble reads in a normal .fasta file using color space assembly options.