Color space file formats

The .csfasta file format is often used for color space data. That format looks like this:

#picked reads from data/reads/SHIRAZ_20080320_MP_2_Sample1_F3.csfasta.original, panel range: 600 - 6

So, it is very similar to the fasta file format. It does, however, allow one or more lines starting with # before the first sequence. The sequences are specified as a nucleotide followed by the colors encoded as numbers where 0 is blue, 1 is green, 2 is yellow, and 3 is red. So the sequence:

Image colorspace10

Would be coded like this in a .csfasta file:


The T is the nucleotide that is known from the primer and the numbers indicate the colors. Because the T came from the primer, it is not part of the sequenced DNA molecule. Thus, this letter should be ignored when analyzing the read. So this sequence would look like this in .fasta format:

So there is one nucleotide for each experimentally determined color (i.e.numbers in the .csfasta file).

The .csfasta does not contain any significant information that is not also present in a standard fasta file of the same sequences. The only extra information is the last nucleotide of the primer, which is not useful in later analyses.

So from the viewpoint of software programs analyzing read data, color space is just yet another file format for reads along with .fasta, .fastq, .sff, etc. Thus, in the Assembly Cell programs, color space options for assembly have no connection to file formats. You can choose to assemble SOLiD data in .csfasta format without using the color space options for assembly and you can also choose to assemble reads in a normal .fasta file using color space assembly options.