In all living cells containing hereditary material such as DNA, a transcription to mRNA and subsequent a translation to proteins occur. This is of course simplified but is in general what is happening in order to have a steady production of proteins needed for the survival of the cell. In bioinformatics analysis of proteins it is sometimes useful to know the ancestral DNA sequence in order to find the genomic localization of the gene. Thus, the translation of proteins back to DNA/RNA is of particular interest, and is called reverse translation or back-translation.
In 1968 the Nobel Prize in Medicine was awarded to Robert W. Holley, Har Gobind Khorana and Marshall W. Nirenberg for their interpretation of the Genetic Code (http://nobelprize.org/medicine/laureates/1968/). The Genetic Code represents translations of all 64 different codons into 20 different amino acids. Therefore it is no problem to translate a DNA/RNA sequence into a specific protein. But due to the degeneracy of the genetic code, several codons may code for only one specific amino acid. This can be seen in the table below. After the discovery of the genetic code it has been concluded that different organism (and organelles) have genetic codes which are different from the "standard genetic code". Moreover, the amino acid alphabet is no longer limited to 20 amino acids. The 21'st amino acid, selenocysteine, is encoded by an 'UGA' codon which is normally a stop codon. The discrimination of a selenocysteine over a stop codon is carried out by the translation machinery. Selenocysteines are very rare amino acids.
The table below shows the Standard Genetic Code which is the default translation table.
|TTT F Phe||TCT S Ser||TAT Y Tyr||TGT C Cys|
|TTC F Phe||TCC S Ser||TAC Y Tyr||TGC C Cys|
|TTA L Leu||TCA S Ser||TAA * Ter||TGA * Ter|
|TTG L Leu i||TCG S Ser||TAG * Ter||TGG W Trp|
|CTT L Leu||CCT P Pro||CAT H His||CGT R Arg|
|CTC L Leu||CCC P Pro||CAC H His||CGC R Arg|
|CTA L Leu||CCA P Pro||CAA Q Gln||CGA R Arg|
|CTG L Leu i||CCG P Pro||CAG Q Gln||CGG R Arg|
|ATT I Ile||ACT T Thr||AAT N Asn||AGT S Ser|
|ATC I Ile||ACC T Thr||AAC N Asn||AGC S Ser|
|ATA I Ile||ACA T Thr||AAA K Lys||AGA R Arg|
|ATG M Met i||ACG T Thr||AAG K Lys||AGG R Arg|
|GTT V Val||GCT A Ala||GAT D Asp||GGT G Gly|
|GTC V Val||GCC A Ala||GAC D Asp||GGC G Gly|
|GTA V Val||GCA A Ala||GAA E Glu||GGA G Gly|
|GTG V Val||GCG A Ala||GAG E Glu||GGG G Gly|
In order to solve these ambiguities of reverse translation you can define how to prioritize the codon selection, e.g:
- Choose a codon randomly.
- Select the most frequent codon in a given organism.
- Randomize a codon, but with respect to its frequency in the organism.
As an example we want to translate an alanine to the corresponding codon. Four different codons can be used for this reverse translation; GCU, GCC, GCA or GCG. By picking either one by random choice we will get an alanine.
The most frequent codon, coding for an alanine in E. coli is GCG, encoding 33.7% of all alanines. Then comes GCC (25.5%), GCA (20.3%) and finally GCU (15.3%). The data are retrieved from the Codon usage database, see below. Always picking the most frequent codon does not necessarily give the best answer.
By selecting codons from a distribution of calculated codon frequencies, the DNA sequence obtained after the reverse translation, holds the correct (or nearly correct) codon distribution. It should be kept in mind that the obtained DNA sequence is not necessarily identical to the original one encoding the protein in the first place, due to the degeneracy of the genetic code.
In order to obtain the best possible result of the reverse translation, one should use the codon frequency table from the correct organism or a closely related species. The codon usage of the mitochondrial chromosome are often different from the native chromosome(s), thus mitochondrial codon frequency tables should only be used when working specifically with mitochondria.
Other useful resources
The Genetic Code at NCBI:
Codon usage database:
Wikipedia on the genetic code