The clc_sequence_info Program
The clc_sequence_info program gives some basic information about the sequences in a fasta file:
File data/paired.fasta Number of sequences 47356 Residue counts: Total 11114027 Sequence length: Minimum 170 Maximum 240 Average 234.69
Using the `-r' options include counts of the different types of nucleotides, with all ambiguous nucleotides counted as N's. The `-a' option used together with the `-r' option does the counts for amino acids.
The lengths of the sequences can be printed or summarized using the `-l' and `-k' options, respectively.
It is also possible to get various sequence length statistics. Using the `-n' option, the N50 value of the sequences is calculated. The N50 value means that the sum of sequences of this length or longer is at least 50% of the total length of all sequences. This is useful to get a quick quality overview of a de novo assembly.
Use the `-c' option to disregard all sequences under a certain length from being considered in the statistics. This is sometimes useful for analyzing de novo assembly results, where short sequences may not be of interest.
Further details are available in Options for All Programs.