BLAST against local data
Running a local BLAST search
To run a BLAST search locally, go to:
Tools | BLAST (
)| BLAST (
)
A keyboard shortcut is available: Ctrl+Shift+L (Windows) or
+Shift+L (mac).
When the query involves just a region of a single sequence, launching the tool directly from an open view of the sequence or sequence list may be preferable:
select the region of interest in the sequence |
right-click on the selected region | BLAST Selection Against
Local Data (
)
Specify the query sequences
After launching BLAST from the Tools menu, one or more sequences or sequence lists of the same type, DNA or protein, are selected to search with (figure 16.2).
Figure 16.2: Specify one or more query sequences or sequence lists for the BLAST search.
Specify the search type
In the next wizard step, the type of BLAST search to run is specified (figure 16.3).
Figure 16.3: Specify the type of search to run and the database to search.
BLAST search types for nucleotide query sequences:
- Nucleotide query against nucleotide database When this type of search is specified, an option is available in the next wizard step to specify the type of blastn search to run. Megablast, which is designed to find very similar sequences (e.g. >95% similarity), is the default. Further details about this are below.
- Translated query against peptide database (blastx) DNA query sequences, translated in six frames, are used to search the selected peptide database. The genetic code to use to translate the query sequences is specified in the next wizard step.
- Translated query against a translated database (tblastx) DNA query sequences, translated in six frames, are used to search the selected nucleotide database, the entries of which are also translated in six frames. The genetic code to use to translate the query and the database are specified individually in the next wizard step. This type of search is computationally intensive.
BLAST search types for peptide query sequences:
- Peptide query against peptide database (blastp)
- Peptide query against translated DNA database (tblastn) Peptide query sequences are used to search a nucleotide database, which is translated in six frames using the genetic code specified in the following wizard step. This type of search is computationally intensive.
Specify what target to search against
In the same wizard step (figure 16.3), the target to search against is specfied. This can be:
- Sequences Sequences or sequence lists from the Workbench Navigation Area.
A temporary BLAST database is created from these, which is used in the search. This option is generally recommended only for searches against relatively small datasets. Note that hit sequences will not be retrievable directly from the BLAST results as the temporary BLAST database is deleted after the search is completed.
- BLAST Database A BLAST database from a designated BLAST database folder. Only databases relevant for the selected search type are listed.
Managing, creating and obtaining BLAST databases is described in the following sections:
Refine the BLAST search options
In the following wizard step, search settings can be refined (figure 16.4). The options available depend on the type of search being run. These are described briefly below. Most are described in more detail at https://blast.ncbi.nlm.nih.gov/doc/blast-topics/blastsearchparams.html.
Figure 16.4: The settings for the local BLAST search can be customized. The settings available depend on the type of search being run.
- Optimization This section is shown for searches against a nucleotide database. When searching for somewhat similar sequences (blastn), word size, gap costs and match scoring can be adjusted. The default is to optimize for finding highly similar sequences (megablast). Using megablast, word size, gap costs and match scoring settings are locked.(See figure 16.4.)
- Choose genetic code A drop-down list of genetic codes is available when DNA query sequences will be translated before searching (blastx, tblastx) and when the translation of a DNA database will be searched (tblastn, tblastx).(See figure 16.5.)
Figure 16.5: For tblastn searches, a nucleotide database is selected, but the search is run against a translation of that database in 6 frames. The genetic code to use for the translation is selected from a drop-down list. - Max number of hit sequences The maximum number of database sequence matches to include in the BLAST report.
- Expect Matches with an Expect value (E-value) greater than the value in this field will not reported.
Values lower than 1 can be entered as decimals, or in scientific notiation. For example, 0.001, 1e-3 and 10e-4 would be equivalent and acceptable values.
An E-value is a statistical measure representing the number of hits of a given quality, or better, that you would expect to see purely by chance in a database of the same size as the one being searched. I.e. The higher the E-value, the higher the chance that the match is not due to biological similarity with the query sequence. Details of how E-values are calculated can be found at the NCBI: https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.
- Mask low complexity regions Mask segments of the query sequence with low compositional complexity. This can reduce the number of hits reported that are statistically significant but biologically uninteresting (e.g. hits against common acidic-, basic- or proline-rich regions).
- Word Size BLAST starts off by finding word-sized matches between the query and database sequences and then initiating extensions from these matches. Thus the sensitivity and speed of a blastn search can be tuned by increasing or decreasing the word size.
For nucleotide-nucleotide searches an exact match of a word is required before an extension is initiated. The word size for megablast is locked at 28, but the word size for blastn searches can be adjusted.
For searches against peptide databases (including translated nucleotide), non-exact word matches are taken into account based upon the similarity between words. Word sizes of 2 or 3 are common for such searches.
- Match/mismatch When searching against a nucleotide database, scoring includes assigning a positive value when a base in the query matches a base in the database sequence (match) and a negative value when the bases do not match (mismatch). Match/mismatch scores are locked for megablast.
- Matrix When searching against peptide databases (including translated nucleotide databases), scoring makes use of an amino acid substitution matrix. A selection are available from the drop-down list, with BLOSUM62 set as the default. See https://www.ncbi.nlm.nih.gov/books/NBK279684/#appendices.BLAST_Substitution_Matrices for further details.
- Gap Cost The cost to open a gap (existence) and the cost to extend a gap (extension) in the alignment between the query and target sequences. See https://www.ncbi.nlm.nih.gov/books/NBK279684/#appendices.BLASTN_rewardpenalty_values for further information about gap costs.
- Filter out redundant results. Enabling this option culls HSPs on a per subject sequence basis by removing HSPs that are completely enveloped by another HSP.
- Number of threads. Specify the number of threads to use. Using more threads can reduce the runtime for large databases.
Click on Finish to launch the analysis.
