The NCBI BLAST web pages and the BLAST command line tool offer a
number of different options which can be changed in order to obtain
the best possible result. Changing these parameters can have a great
impact on the search result. It is not the scope of this document to
comment on all of the options available but merely the options which
can be changed with a direct impact on the search result.
The expect value
(E-value) can be changed in order to limit
the number of hits to the most significant ones. The lower the
E-value, the better the hit. The E-value is dependent on the length
of the query sequence and the size of the database. For example, an
alignment obtaining an E-value of 0.05 means that there is a 5 in
100 chance of occurring by chance alone.
E-values are very dependent on the query sequence length and the
database size. Short identical sequence may have a high E-value and
may be regarded as "false positive" hits. This is often seen if one
searches for short primer regions, small domain regions etc. The
default threshold for the E-value on the BLAST web page is 10.
Increasing this value will most likely generate more hits. Below are
some rules of thumb which can be used as a guide but should be
considered with common sense.
- E-value < 10e-100 Identical sequences. You will get long alignments across the entire query and hit sequence.
- 10e-100 < E-value < 10e-50 Almost identical sequences. A long stretch of the query protein is matched to the database.
- 10e-50 < E-value < 10e-10 Closely related sequences, could be a domain match or similar.
- 10e-10 < E-value < 1 Could be a true homologue but it is a gray area.
- E-value > 1 Proteins are most likely not related
- E-value > 10 Hits are most likely junk unless the query sequence is very short.
For blastp it is possible to specify gap cost for the chosen
substitution matrix. There is only a limited number of options for
these parameters. The open gap cost
is the price of
introducing gaps in the alignment, and extension gap cost
the price of every extension past the initial opening gap.
Increasing the gap costs will result in alignments with fewer gaps.
It is possible to set different filter options before running the
BLAST search. Low-complexity regions have a very simple composition
compared to the rest of the sequence and may result in problems
during the BLAST search [Wootton and Federhen, 1993
]. A low complexity region
of a protein can for example look like this 'fftfflllsss', which in
this case is a region as part of a signal peptide. In the output of
the BLAST search, low-complexity regions will be marked in lowercase
gray characters (default setting). The low complexity region cannot
be thought of as a significant match; thus, disabling the low
complexity filter is likely to generate more hits to sequences which
are not truly related.
Change of the word size has a great impact on the seeded sequence
space as described above. But one can change the word size to find
sequence matches which would otherwise not be found using the
default parameters. For instance the word size can be decreased when
searching for primers or short nucleotides. For blastn a suitable
setting would be to decrease the default word size of 11 to 7,
increase the E-value significantly (1000) and turn off the
For blastp a similar approach can be used. Decrease the word size to
2, increase the E-value and use a more stringent substitution
matrix, e.g. a PAM30 matrix.
Fortunately, the optimal search options for finding short, nearly
exact matches can already be found on the BLAST web pages
For protein BLAST searches, a default substitution matrix is
provided. If you are looking at distantly related proteins, you
should either choose a high-numbered PAM matrix or a low-numbered
BLOSUM matrix. The default scoring
matrix for blastp is BLOSUM62.