Pfam domain search

With CLC Genomics Workbench you can perform a search for Pfam domains on protein sequences. The Pfam database at is a large collection of multiple sequence alignments that covers approximately 9318 protein domains and protein families [Bateman et al., 2004]. Based on the individual domain alignments, profile HMMs have been developed. These profile HMMs can be used to search for domains in unknown sequences.

Many proteins have a unique combination of domains which can be responsible, for instance, for the catalytic activities of enzymes. Pfam was initially developed to aid the annotation of the C. elegans genome. Annotating unknown sequences based on pairwise alignment methods by simply transferring annotation from a known protein to the unknown partner does not take domain organization into account [Galperin and Koonin, 1998]. An unknown protein may be annotated wrongly, for instance, as an enzyme if the pairwise alignment only finds a regulatory domain.

Using the Pfam search option in CLC Genomics Workbench, you can search for domains in sequence data which otherwise do not carry any annotation information. The Pfam search option adds all found domains onto the protein sequence which was used for the search. If domains of no relevance are found they can easily be removed as described in Removing annotations. Setting a lower cutoff value will result in fewer domains.

In CLC Genomics Workbench we have implemented our own HMM algorithm for prediction of the Pfam domains. Thus, we do not use the original HMM implementation,
HMMER for domain prediction. We find the most probable state path/alignment through each profile HMM by the Viterbi algorithm and based on that we derive a new null model by averaging over the emission distributions of all Mand I states that appear in the state path (M is a match state and I is an insert state). From that model we now arrive at an additive correction to the original bit-score, like it is done in the original HMMER algorithm.

In order to conduct the Pfam search:

        Select a protein sequence | Toolbox in the Menu Bar | Classical Sequence Analysis (Image gene_and_protein_analysis) | Protein Analysis (Image proteinanalyses)| Pfam Domain Search (Image pfam)

   or  right-click a protein sequence | Toolbox | Classical Sequence Analysis (Image gene_and_protein_analysis) | Protein Analysis (Image proteinanalyses)| Pfam Domain Search (Image pfam)

Image pfam_step2
Figure 16.16: Setting parameters for Pfam domain search.

If a sequence was selected before choosing the Toolbox action, this sequence is now listed in the Selected Elements window of the dialog. Use the arrows to add or remove sequences or sequence lists from the selected elements.

You can perform the analysis on several protein sequences at a time. This will add annotations to all the sequences and open a view for each sequence. Click Next to adjust parameters (see figure 16.16).