The clc_sequence_info Program

The clc_sequence_info program gives some basic information about the sequences in a fasta file:

File                           data/paired.fasta
Number of sequences                 47356
Residue counts:
  Total                          11114027
Sequence length:
  Minimum                             170
  Maximum                             240
  Average                             234.69

Using the `-r' options include counts of the different types of nucleotides, with all ambiguous nucleotides counted as N's. The `-a' option used together with the `-r' option does the counts for amino acids.

The lengths of the sequences can be printed or summarized using the `-l' and `-k' options, respectively.

It is also possible to get various sequence length statistics. Using the `-n' option, the N50 value of the sequences is calculated. The N50 value means that the sum of sequences of this length or longer is at least 50% of the total length of all sequences. This is useful to get a quick quality overview of a de novo assembly.

Use the `-c' option to disregard all sequences under a certain length from being considered in the statistics. This is sometimes useful for analyzing de novo assembly results, where short sequences may not be of interest.

Further details are available in Options for All Programs.