Specifying the data for the assembly and how it should be used

The parameters described in this section can be used multiple times within a single clc_assembler command.

Information provided using flags that can be specified multiple times in a single command pertains to all inputs that follow until you specify otherwise. For example, if you state you wish reads to be used for guidance only, all datasets entered via the command from that point forward will be used only for guidance, until a point in the command where you choose to indicate that reads should be used at all stages of the assembly. Please see the examples section below for further details on this.

Some of these options are most easily explained by considering the de novo as consisting of four steps:

  1. Create fragments from reads.
  2. Connect fragments based on all reads to form a graph.
  3. Optimize the graph based on all reads.
  4. Optimize the graph based on paired reads.

-p <par> / --paired <par>

Here, par is a set of parameters, which indicate the pair status of your data, and for paired data, the relative orientations and expected distances between members of the pair. Options after the -p flag are also used to indicate if paired data are in two files and also whether any particular data should be used only for its paired distance information during the final phase of the assembly.

Paired status

Data are assumed to be paired by default. To indicate that the data contain single reads, that is, that the data are not paired, the single string value "no" is given. I.e. "-p no" followed by the name of the read file.

Paired information

For paired data, par consists of four strings: <mode> <distance_mode> <min_dist> <max_dist>

mode indicates the relative orientation of the reads in a pair set. These can be ff, fb, bf or bb. These are used for "forward-forward" reads, "forward-reverse" reads, "reverse-forward" reads and "reverse-reverse" reads5.2.

distance_mode indicates the point on paired reads from which the distance measure you provide is taken. The options are ss, se, es or ee. These mean "start-start", "start-end" reads, "end-start" reads and "end-end".

<min_dist> and <max_dist> give the minimum and maximum distance range for the distances between the pairs. Where to take the start and end points of this distance range is what is specified by the distance_mode described above.

So, -p fb ss 180 250 would indicate that the reads are inverted and pointing towards each other, that the distance range includes the sequences of both the reads as well as the fragment between them, and that the distance range is between 180 and 250 bases.

How data should be used in the assembly

-p d <mode> <distance_mode> <min_dist> <max_dist> An additional option, d, can be added to the -p flag to indicate that the reads should only be used in the fourth step of the assembly, as listed at the top of this section. That is, the paired distance information associated with these reads is used when optimizing the graph. The sequence information itself is not used towards the assembly result. This is useful for data such as SOLiD, where the reads are quite short and may contain many errors. Such reads would not, by themselves, be of great help in building and optimizing the graph initially, but the paired information associated with them can be valuable during the final graph optimization stage.

Please also refer to the information about the -g option in the section below.

-g <mode>/ --fragmentmode <mode>

You might choose to use this option with longer reads that are expected to contain many errors, for example, in the case of 454 reads.

Here, mode can be one of two values: "use" and "ignore". The default is that all data is run with -g 'use'.

Providing the "-g ignore" before the name of a read set indicates that this read set (and all others after this point in the command, until any point where the "-g use" option is entered in the command) is not used in the first step of the assembly, as described above, where fragments are generated from the reads. In other words, read sets following "-g ignore" in the command, are used in steps 2, 3 and 4 as listed above, where the graph is determined and optimized.

-n / --no-scaffolding Pair distance information is used for the creation of contig sequences, but no scaffolding (i.e. making associations between contigs) is performed.

-q / --reads

This flag indicates that the information that follows are read filenames. Read files can be in fasta, fastq or sff format.

-i <file1> <file2> / --interleave <file1> <file2>

To input paired read data that is in two files, where one read of each pair is in one file and the other of each pair is in the second file, the pair of file names should be provided after the -i flag.

Read files for paired data entered without the -i flag are assumed to contain interleaved pairs. That is, the first sequence is the first member of a pair, the second sequence is the second member of that same pair, the third sequence is the first member of the second pair, the fourth sequence is the second member of the second pair, and so on.

Mixed files, that is, single input files that contain both paired and unpaired reads, cannot be used as input with the clc_assembler command, unless the intention is to treat all reads as single reads.



Footnotes

... reads5.2
The letter "b" in the mode strings refers to the word "backwards", but the more common word to use to describe this relative orientation is "reverse".