Calculation of sequence logos

A comprehensive walk-through of the calculation of the information content in sequence logos is beyond the scope of this document but can be found in the original paper by [Schneider and Stephens, 1990]. Nevertheless, the conservation of every position is defined as $ R_{seq}$ which is the difference between the maximal entropy ($ S_{max}$ ) and the observed entropy for the residue distribution ($ S_{obs}$ ),

$ \displaystyle R_{seq}=S_{max}-S_{obs}=\log_2N-\bigg(-\sum_{n=1}^Np_n\log_2p_n\bigg) $

$ p_n$ is the observed frequency of a amino acid residue or nucleotide of symbol $ n$ at a particular position and $ N$ is the number of distinct symbols for the sequence alphabet, either 20 for proteins or four for DNA/RNA. This means that the maximal sequence information content per position is $ \log_2 4=2~bits$ for DNA/RNA and $ \log_2 20 \approx 4.32~bits$ for proteins.

The original implementation by Schneider does not handle sequence gaps.

We have slightly modified the algorithm so an estimated logo is presented in areas with sequence gaps.

If amino acid residues or nucleotides of one sequence are found in an area containing gaps, we have chosen to show the particular residue as the fraction of the sequences. Example; if one position in the alignment contain 9 gaps and only one alanine (A) the A represented in the logo has a hight of 0.1.

Other useful resources
The website of Tom Schneider
http://www-lmmb.ncifcrf.gov/~toms/

WebLogo
http://weblogo.berkeley.edu/

[Crooks et al., 2004]