Find open reading frames

Find Open Reading Frames identifies open reading frames (ORFs) in sequences, and can be used as a rudimentary gene finder.

During translation of a transcript, protein is generated from the first start codon to the stop codon, internal start codons are translated to their respective amino acids. Find Open Reading Frames correspondingly always reports ORFs using the first possible start codon and ignores internal start codons.

Identified ORFs are shown as annotations on the sequence. Different genetic codes are available, but it is also possible to manually specify start codons.

In one analysis, Find Open Reading Frames can process a maximum of 100,000 sequences or 50 million base pairs. Sequences may be provided to the tool as individual sequences or as sequence lists.

To run Find Open Reading Frames:

        Toolbox | Classical Sequence Analysis (Image gene_and_protein_analysis) | Nucleotide Analysis (Image nucleotideanalyses)| Find Open Reading Frames (Image orf)

This opens the dialog displayed in figure 19.7

Image readingframedialog
Figure 19.7: Select a sequence or a sequence list as input.

If a sequence was selected before choosing the Toolbox action, the sequence is now listed in the Selected Elements window of the dialog. Use the arrows to add or remove sequences or sequence lists from the selected elements.

Next, specify which parameters should be used (figure 19.8)

Image readingframenext
Figure 19.8: Set parameters for identification of open reading frames.

Using open reading frames to find genes is a fairly simple approach which is likely to predict genes which are not real. Setting a relatively high minimum length of the ORFs will reduce the number of false positive predictions, but at the same time short genes may be missed (see figure 19.9).

Image orf_coli
Figure 19.9: The first 12,000 positions of the E. coli sequence NC_000913 downloaded from GenBank. The blue (dark) annotations are the genes while the yellow (brighter) annotations are the ORFs with a length of at least 100 amino acids. On the positive strand around position 11,000, a gene starts before the ORF. This is due to the use of the standard genetic code rather than the bacterial code. This particular gene starts with CTG, which is a start codon in bacteria. Two short genes are entirely missing, while a handful of open reading frames do not correspond to any of the annotated genes.

Click Finish to start the tool.

Finding open reading frames is often a good first step in annotating sequences such as cloning vectors or bacterial genomes. For eukaryotic genes, ORF determination may not always be very helpful since the intron/exon structure is not part of the algorithm.