There are a number of options that can be configured when using BLAST search programs.
Setting these options to relevant values can have a great
impact on the search result. A few of the key settings are described briefly below.
The expect value
(E-value) describes the number of hits one can expect to see matching the query by chance when searching against a database of a given size. An E-value of 1 can be interpreted as meaning that in a search like the one just run, you could expect to see 1 match of the same score by chance once. That is, a match that is not homologous to the query sequence. When looking for very similar sequences in a database, it is often beneficial to use very low E-values.
E-values depend on the query sequence length and the database size. Short identical sequence may have a high E-value and
may be regarded as "false positive" hits. This is often seen if one
searches for short primer regions, small domain regions etc. Below are
some comments on what one could infer from results with E-values in particular ranges.
- E-value < 10e-100 Identical sequences. You will get long alignments across the entire query and hit sequence.
- 10e-100 < E-value < 10e-50 Almost identical sequences. A long stretch of the query matches the hit sequence.
- 10e-50 < E-value < 10e-10 Closely related sequences, could be a domain match or similar.
- 10e-10 < E-value < 1 Could be a true homolog, but it is a gray area.
- E-value > 1 Proteins are most likely not related
- E-value > 10 Hits are most likely not related unless the query sequence is very short.
For blastp it is possible to specify gap cost for the chosen
substitution matrix. There is only a limited number of options for
these parameters. The open gap cost
is the price of
introducing gaps in the alignment, and extension gap cost
the price of every extension past the initial opening gap.
Increasing the gap costs will result in alignments with fewer gaps.
It is possible to set different filter options before running a
BLAST search. Low-complexity regions have a very simple composition
compared to the rest of the sequence and may result in problems
during the BLAST search [Wootton and Federhen, 1993
]. A low complexity region
of a protein can for example look like this 'fftfflllsss', which in
this case is a region as part of a signal peptide. In the output of
the BLAST search, low-complexity regions will be marked in lowercase
gray characters (default setting). The low complexity region cannot
be thought of as a significant match; thus, disabling the low
complexity filter is likely to generate more hits to sequences which
are not truly related.
Changing the word size has a great impact on the seeded sequence
space as described above. But one can change the word size to find
sequence matches which would otherwise not be found using the
default parameters. For instance the word size can be decreased when
searching for primers or short nucleotides. For blastn a suitable
setting would be to decrease the default word size of 11 to 7,
increase the E-value significantly (1000) and turn off the
For blastp a similar approach can be used. Decrease the word size to
2, increase the E-value and use a more stringent substitution
matrix, e.g. a PAM30 matrix.
The BLAST search programs at the NCBI adjust settings automatically when short sequences are being used for searches, and there is a dedicated page, Primer-BLAST, for searching for primer sequences. https://blast.ncbi.nlm.nih.gov/Blast.cgi.
For protein BLAST searches, a default substitution matrix is
provided. If you are looking at distantly related proteins, you
should either choose a high-numbered PAM matrix or a low-numbered
BLOSUM matrix. The default scoring
matrix for blastp is BLOSUM62.