README for stand-alone BLAST

$Date: 2004/02/04 15:44:35 $

This document provides information on stand-alone BLAST. Topics covered are
setting up stand-alone BLAST, command-line options for stand-alone BLAST,
and a release history of the different versions.

NCBI provides binaries for the following platforms:

Apple MacOS 9 (powerpc)
Apple MacOS X (powerpc)
DEC/Compaq/HP OSF1 5.1 (alpha)
FreeBSD 4.5 (ia32)
HP HPUX 11 (hppa, ia64)
IBM AIX 5.1 (power4, powerpc)
Linux (kernel 2.4, glibc 2.2.5) (ia32, ia64, amd64)
Microsoft Windows 2000 (ia32)
SGI IRIX 6.5 (mips)
Sun Solaris 7 (ia32)
Sun Solaris 8 (sparc)

We will attempt to produce binaries for other platforms upon request.

Stand-alone binaries are available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/

Please remember to FTP in binary mode.

Setting up Standalone BLAST for UNIX:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Basically, there are three steps needed to setup the Standalone BLAST
executable for the UNIX platform.

1) Download the UNIX binary, uncompress and untar the file. It is
suggested that you do this in a separate directory, perhaps called
"blast".

2) Create a .ncbirc file. In order for Standalone BLAST to operate, you
have will need to have a .ncbirc file that contains the following lines:

[NCBI]
Data="path/data/"

Where "path/data/" is the path to the location of the Standalone BLAST
"data" subdirectory. For Example:

Data=/root/blast/data

The data subdirectory should automatically appear in the directory where
the downloaded file was extracted. Please note that in many cases it may
be necessary to delimit the entire path including the machine name and
or the net work you are located on. Your systems administrator can help
you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you
call the Standalone BLAST program from or in your root directory.

3) Format your BLAST database files. The main advantage of Standalone
BLAST is to be able to create your own BLAST databases. This can be done
with any file of FASTA formatted protein or nucleotide sequences. If you
are interested in creating your own database files you should refer to
the sections "Non-redundant defline syntax" and "Appendix 1: Sequence
Identifier Syntax" of the README in the BLAST database directory
(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA
description available from the BLAST search pages
(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI
databases and run a search against it.

In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/)
you will find the downloadable BLAST database files. For your first
search we recommend downloading something relatively small like
ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide
sequences which is also compressed. Once uncompressed, you will need to
format the database using the 'formatdb' program which comes with your
Standalone BLAST executable. The list of arguments for this program and
all other BLAST programs are located at the end of the README in the
Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or
you can get these arguments by running each of the BLAST programs (formatdb,
blastall etc.) with a single hyphen as the argument (Example: formatdb -). For
this document we are just going to show you the basic commands for formatting
the database and running your first search.

To format the ecoli.nt database run the following from the command
line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to
perform the searches and produce results. The ecoli.nt file is not
needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database. It
may be easier to 'cheat' here and just extract a portion of a
nucleotide sequence you know is in the downloaded ecoli.nt database.
Make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search enter the following command from the UNIX
command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone
BLAST directory.

Now you are ready to create your own databases and run BLAST searches.
For more information you should refer to the Standalone BLAST README (
ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature.
This will give you some idea of all the programs BLAST supports and the
use of different parameters for increasing or decreasing the stringency
of your results.

If you have any questions please send them to the
blast-help@ncbi.nlm.nih.gov e-mail address.

Setting up Standalone BLAST for Windows
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

There are three steps needed to setup the Standalone BLAST
executable.

1) Download and compress the Standalone BLAST Windows binary
blastcz.exe. We suggest doing this in it's own directory, perhaps called
blast. This is a 'self-extracting' archive and all you need to do is run
this either through a Command Prompt (DOS Prompt) or by selecting "Run"
from the Windows "Start button" and browsing the blastcz.exe file.

2) Create an ncbi.ini file. In order for Standalone BLAST to operate,
you have will need to have an ncbi.ini file that contains the following
lines:

[NCBI]
Data="C:\path\data\"

Where "C:path\data\" is the path to the location of the Standalone
BLAST "data" subdirectory. For example:

Data=C:\blast\data

This data subdirectory should automatically appear in the directory
where the downloaded file was extracted.

Make sure that your ncbi.ini file is in the Windows or WINNT directory
on your machine. Note: If you already have an ncbi.ini file on your
machine from installing other NCBI software(Network Entrez, Sequin etc.)
you can skip this section. However, if you see the following error
message, you should rename the old ncbi.ini file to something like
ncbi.bak and follow the instructions in number 2 above.

Abrupt: code=1
FATAL ERROR: FindPath failed.

C) The main advantage of Standalone BLAST is to be able to create your
own BLAST databases. This can be done with any file of FASTA formatted
protein or nucleotide sequences. If you are interested in creating your
own database you should refer to the sections "Non-redundant defline
syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in
the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can
also refer to the FASTA description available from the BLAST search
pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI
databases and run a search against it.

In the BLAST database FTP directory ftp://ftp.ncbi.nih.gov/blast/db/
you will find the downloadable BLAST database files. For your first
search we recommend downloading something relatively small like
ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide
sequences which is also compressed. (If you do not have a copy of UNIX
"uncompress" for your Windows PC contact NCBI Info at
info@ncbi.nlm.nih.gov).

Once uncompressed, you will now need to format the database using the
'formatdb' program which comes with your Standalone BLAST executable.
The list of arguments for this program and all other BLAST programs are
located at the end of the README in the Standalone BLAST FTP directory
(ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these
arguments by running each of the BLAST programs (formatdb, blastall
etc.) with a single hyphen as the argument (Example: formatdb -). For
this document we are just going to show you the basic commands for
formatting the database and running your first search.

To format the ecoli.nt database run the following from the command
line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to
perform the searches and produce results. The ecoli.nt file can be
removed once formatdb has been run.

Next create a test nucleotide file to run against the new database. It
may be easier to 'cheat' here and just extract a portion of a
nucleotide sequence you know is in the downloaded ecoli.nt database.
So make a text file called test.txt with the following sequence:

To run the first search just do the command:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone
BLAST directory. Now you are ready to create your own databases and run
BLAST searches. For more information you should refer to the Standalone
BLAST README ( ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST
literature. This will give you some idea of all the programs BLAST
supports and the use of different parameters for increasing or
decreasing the stringency of your results.

If you have any questions please send them to the
blast-help@ncbi.nlm.nih.gov e-mail address.

SGI Note:
---------

SGI recommends the following threads patches on IRIX6 systems:

   For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order)
   For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order)
   For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order)

These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/

System recommendations:
----------------------

BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if
it can read the entire BLAST database into memory, then keep on using it
there. Resources consumed reading a database into memory can easily
outweight the cost of a BLAST search, so that the memory of a machine is
normally more important than the CPU speed. This means that one should have
sufficient memory for the largest BLAST database one will use, then run all
the searches against this databases in serial, then run queries against
another database in serial. This guarantees that the database will be read
into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg,
which translates to about 170-200 Meg of BLAST database. At least another
100-200 Meg should be allowed for memory consumed by the actual BLAST
program. All of the FASTA databases together are about 1.5 Gig, the BLAST
databases produced from this will probably be about another Gig or so. 4 Gig
of disk space, to make room for software and output, is probably a pretty
good bet.

OSF1 and limit
--------------

Some OSF1 users have encountered "out of memory" problems when running searches
even though there seems to be plenty of memory on the machine and the search
runs well on other platforms. The error message would look something like:

[blastall] FATAL ERROR: CoreLib [001.000] gi|509180|emb|X71670.1|MMP17SAR: Failed to allocate 480 bytes

Often it is sufficient to simply raise the "datasize" limit, which specifies
the maximum allowed heap size. The "datasize" limit can be changed by executing:

limit datasize unlimited

Note that this change only applies to the current session, so it is advisable to place
this command in some file sourced at startup, such as .login or .cshrc.

BLAST OPTIONS
-------------

Formatdb
--------

There is now a separate document describing formatdb (README.formatdb). Please
refer to it for information on formatting FASTA files for BLAST searches.

Blastall
--------

Blastall may be used to perform all five flavors of blast comparison. One
may obtain the blastall options by executing 'blastall -' (note the dash). A
typical use of blastall would be to perform a blastn search (nucl. vs. nucl.)
of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed
against the 'nr' database. If a protein vs. protein search is desired,
then 'blastn' should be replaced with 'blastp' etc.

Some of the most commonly used blastall options are:

blastall arguments:

-p Program Name [String]

Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".

-d Database [String]
default = nr

        The database specified must first be formatted with formatdb.
        Multiple database names (bracketed by quotations) will be accepted.
        An example would be

-d "nr est"

        which will search both the nr and est databases, presenting the results as if one
        'virtual' database consisting of all the entries from both were searched.   The
        statistics are based on the 'virtual' database of nr and est.

-i Query File [File In]
default = stdin

The query should be in FASTA format. If multiple FASTA entries are in the input
file, all queries will be searched.

-e Expectation value (E) [Real]
default = 10.0

-o BLAST report Output File [File Out] Optional
default = stdout

-F Filter query sequence (DUST with blastn, SEG with others) [String]
default = T

         BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the
         other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit
         and are accessed automatically.

If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever).

         This options also takes a string as an argument. One may use such a
         string to change the specific parameters of seg or invoke other filters.
         Please see the "Filtering Strings" section (below) for details.

-S Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer]
default = 3

-T Produce HTML output [T/F]
default = F

-l Restrict search of database to list of GI's [String] Optional

This option specifies that only a subset of the database should be
searched, determined by the list of gi's (i.e., NCBI identifiers) in a
file. One can obtain a list of gi's for a given Entrez query from
http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should
be in the same directory as the database, or in the directory that
BLAST is called from.

-U Use lower case filtering of FASTA sequence [T/F] Optional
default = F

This option specifies that any lower-case letters in the input FASTA file
should be masked.

Documentation for PSI-TBLASTN

PSI-BLASTN is a variant of blastall that searches a protein query
sequence against a nucleotide sequence database using a position
specific matrix created by PSI-BLAST. The nucleotide sequence database
is dynamically translated in all reading frames during PSI-TBLASTN
search. Using a position specific matrix may enable finding more
distantly related sequences.

Programs:
blastpgp [takes a protein query and perform PSI-BLAST search to
creates a position specific matrix using a protein
database]

blastall [reads position specific matrix and performs PSI-TBLASTN
search]

Usage:
A user would typically run blastpgp to create and save a position
specific matrix, followed by a run of blastall for PSI-TBLASTN search.

blastpgp must be executed with -C option followed by a file name to
save position specific score matrix.

blastall with "-p psitblastn" option executes PSI-TBLASTSN search, and
-R option followed by a file name specifying the file that contains
position specific score matrix. All other options that apply when
using "blastall -p tblastn ..." also apply when using "blastall -p
psitblastn ...", but there are some restrictions to parameters: 1) The
query must be the same as the one used in blastpgp for creating a
position specific matrix. 2) By default, blastpgp has filtering off
(-F F) and blastall has filtering on (-F T). To ensure consistent
usage of the blastpgp/psitblastn combination, the -F option should be
explicitly set in one or the other run.

Example:
One may run PSI-BLST to create and save a position specific score matrix
as follows:

blastpgp -d nr -i ff.chd -j 2 -C ff.chd.ckp

Position specific score matrix is saved in ff.chd.ckp. Then, using
this matrix, one may run PSI-TBLASTN search:

blastall -i ff.chd -d yeast -p psitblastn -R ff.chd.ckp

Note that this allows the score matrix to be constructed using one
database (nr in the example) and then used to search a second database
(yeast in the example). Even if the two database names are the same,
blastpgp uses the protein version while "blastall -p psitblastn" uses
the DNA version.

Blastpgp
--------

Blastpgp performs gapped blastp searches and can be used to perform
iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and
PHI-BLAST sections (below) for a description of this binary. The options may be
obtained by executing 'blastpgp -'.

-T Produce HTML output [T/F]
default = F

-Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional

Bl2seq
------

Bl2seq performs a comparison between two sequences using either the blastn or
blastp algorithm. Both sequences must be either nucleotides or proteins.
The options may be obtained by executing 'bl2seq -'.

-i First sequence [File In]
-j Second sequence [File In]
-p Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String]
    default = blastp
-g Gapped [T/F]
    default = T
-o alignment output file [File Out]
    default = stdout
-d theor. db size (zero is real size) [Integer]
    default = 0
-a SeqAnnot output file [File Out] Optional
-G Cost to open a gap (zero invokes default behavior) [Integer]
    default = 0
-E Cost to extend a gap (zero invokes default behavior) [Integer]
    default = 0
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer]
    default = 0
-W Wordsize (zero invokes default behavior) [Integer]
    default = 0
-M Matrix [String]
    default = BLOSUM62
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
    default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
    default = 1
-F Filter query sequence (DUST with blastn, SEG with others) [String]
    default = T
-e Expectation value (E) [Real]
    default = 10.0
-S Query strands to search against database (blastn only). 3 is both, 1 is top, 2 is bottom [Integer]
    default = 3
-T Produce HTML output [T/F]
    default = F

Fastacmd
--------

There is now a separate document describing fastacmd (README.fastacmd). Please
refer to it for information on using this tool.

Filtering Strings
-----------------

         The -F argument can take a string as input specifying that seg should be
         run with certain values or that other non-standard filters should be used.
         This sections describes this syntax.

The seg options can be changed by using:

-F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5.

         A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991))
         and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked
         by specifying:

-F "C"

         There are three parameters for this: window, cutoff (prob of a coil-coil), and
         linker (distance between two coiled-coiled regions that should be linked
         together). These are now set to

         window: 22
         cutoff: 40.0
         linker: 32

One may also change the coiled-coiled parameters in a manner analogous to
that of seg:

-F "C 28 40.0 32" will change the window to 28.

One may also run both seg and coiled-coiled together by using a ";":

-F "C;S"

Filtering by dust may also be specified by:

-F "D"

         It is possible to specify that the masking should only be done during
         the process of building the initial words by starting the filtering
         command with 'm', e.g.:

-F "m S"

         which specifies that seg (with default arguments) should be used for masking,
         but that the masking should only be done when the words are being built.
         This masking option is available with all filters.

         If the -U option (to mask any lower-case sequence in the input FASTA file) is used and
         one does not wish any other filtering, but does wish to mask when building the lookup tables
         then one should specify:

-F "m"

This is the only case where "m" should be specified alone.

PSI-Blast
---------

The blastpgp program can do an iterative search in which
sequences found in one round of searching are used to build
a score model for the next round of searching. In this usage,
the program is called Position-Specific Iterated BLAST, or PSI-BLAST.
As explained in the accompanying paper, the BLAST algorithm is
not tied to a specific score matrix. Traditionally, it has been
implemented using an AxA substitution matrix where A is the alphabet size.
PSI-BLAST instead uses a QxA matrix, where Q is the length of the query
sequence; at each position the cost of a letter depends on the position
w.r.t. the query and the letter in the subject sequence.

The position-specific matrix for round i+1 is built from a constrained
multiple alignment among the query and the sequences found with
sufficiently low e-value in round i. The top part of the output for
each round distinguishes the sequences into: sequences found
previously and used in the score model, and sequences not used in the
score model. The output currently includes lots of diagnostics
requested by users at NCBI. To skip quickly from the output of
one round to the next, search for the string "producing", which is
part of the header for each round and likely does not appear elsewhere
in the output. PSI-BLAST "converges" and stops if all sequences
found at round i+1 below the e-value threshold were already in
the model at the beginning of the round.

There are several blastpgp parameters specifically for PSI-BLAST:
-j   is the maximum number of rounds (default 1; i.e., regular BLAST)
-h   is the e-value threshold for including sequences in the
     score matrix model (default 0.001)
-c   is the "constant" used in the pseudocount formula specified in the
     paper (default 10)

The -C and -R flags provide a "checkpointing" facility whereby
a score model can be stored and later reused.
   -C stores the query and frequency count ratio matrix in a
                  file
   -R restarts from a file stored previously.
When using -R, it is required that the query specified on the command line
match exactly the query in the restart file.
The checkpoint files are stored in a byte-encoded (not human readable)
format, so as to prevent roundoff error between writing and reading
the checkpoint.
Users who also develop their own sequence analysis software may wish
to develop their own scoring systems. For this purpose the code
in posit.c that writes out the checkpoint can be easily adapated to
write out scoring systems derived by other algorithms in such
a way that PSI-BLAST can read the files in later.
The checkpoint structure is general in the sense that it can handle
any position-specific matrix that fits in the Karlin-Altschul
statistical framework for BLAST scoring.

The -B flag provides a way to jump start PSI-BLAST from a master-slave
multiple alignment computed outside PSI-BLAST. The multiple alignment
must include the query sequence as one of the sequences, but it need
not be the first sequence. The multiple alignment must be specified
in a format that is derived from Clustal, but without some headers and
trailers. See example below. The rules are also described by the
following words. Suppose the multiple alignments has N sequences. It
may be presented in 1 or more blocks, where each block presents a
range of columns from the multiple alignment. E.g., the first block
might have columns 1-60, the second block might have columns 61-95,
the third block might have columns 96-128. Each block should have N
rows, 1 row per sequence. The sequences should be in the same order
in every block. Blocks are separated by 1 or more blank lines.
Within a block there are no blank lines, and each line consists of 1
sequence identifier followed by some white space followed by
characters (and gaps) for that sequence in the multiple alignment. In
each column, all letters must be in upper case, or all letters must be
in lower case. Upper case means that this column is to be given
position-specific scores. Lower-case means to use the underlying
matrix (specified by -M) for this column; e.g., if the query sequence
has an 'l' residue in the column, then the standard scores for
matching an L are used in the column.

A sample usage would be:

blastpgp -i seq1 -B align1 -j 2 -d nr

where seq1 is the query
      align1 is the alignment file
      -j 2 indicates to do 2 rounds
      -d nr indicates to use the nr database

The example files
seq1
align1
copied below were kindly supplied by L. Aravind from a paper
he and Chris Ponting published in Protein Science:

Aravind L, Ponting CP, Homologues of 26S proteasome subunits
are regulators of transcription and translation, Protein Science
7(1998) 1250-1254.

L. Aravind (aravind@ncbi.nlm.nih.gov) was the first user
and helped define how -B should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)
helped design a more flexible input format for the alignments.
If you like how -B works, let them know.
If you do not like how -B works, complain to
A. Schaffer(schaffer@helix.nih.gov) who did the implementation.

seq1
----
> 26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA
LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL
SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP

align1
------
26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr
F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae
YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr
FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
644879        KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl
eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------
T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------
YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl
KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec
F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih

26SPS9_Hs     sladfekaltdy-----------------------------------------------------------------------------------
F57B9_Ce      rslkdfqvafgsf----------------------------------------------------------------------------------
YDL097c_Sc    aynnrslldfntalkqy------------------------------------------------------------------------------
YMJ5_Ce       krslkdfvkalaeh---------------------------------------------------------------------------------
FUS6_ARATH    vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------
COS41.8_Ci    eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------
644879        kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------
YPR108w_Sc    yasdyasyfpyllety-------------------------------------------------------------------------------
eif-3p110_Hs -----------------------------------------------------------------------------------------------
T23D8.4_Ce    -----------------------------------------------------------------------------------------------
YD95_Sp       ylcdysgffrtladve-------------------------------------------------------------------------------
KIAA0107_Hs   rysvffqslavv-----------------------------------------------------------------------------------
F49C12.8_Hs   esyydchydrffiqlaale----------------------------------------------------------------------------
Int-6_Mm      wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk

26SPS9_Hs     ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce      ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc    ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce       ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH    ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ
COS41.8_Ci    ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET
644879        ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ
YPR108w_Sc    ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce    ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP
YD95_Sp       ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs   ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs   ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm      lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV

PHI-Blast
---------

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search
program that combines matching of regular expressions
with local alignments surrounding the match.
The most important features of the program have been
incorporated into the BLAST software framework
partly for user convenience and partly so that
PHI-BLAST may be combined seamlessly with PSI-BLAST.
Other features that do not fit into the BLAST framework
will be released later as a separate program and/or
separate Web page query options.

One very restrictive way to identify protein motifs
is by regular expressions that must contain each instance
of the motif. The PROSITE database is a compilation of
restricted regular expressions that describe protein motifs.
Given a protein sequence S and a regular expression pattern P
occurring in S, PHI-BLAST helps answer the question:
What other protein sequences both contain an occurrence of P
and are homologous to S in the vicinity of the pattern occurrences?
PHI-BLAST may be preferable to just searching for pattern occurrences
because it filters out those cases where the pattern occurrence is
probably random and not indicative of homology.
PHI-BLAST may be preferable to other flavors of BLAST because
it is faster and because it allows the user to express
a rigid pattern occurrence requirement.

The pattern search methods in PHI-BLAST are based on the
algorithms in:

R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82.
S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91.

The calculation of local alignments is done using a method
very similar to (and much of the same code as) gapped BLAST.
However, the method of evaluating statistical significance is different, and
is described below.

In the stand-alone mode the typical PHI-BLAST usage looks like:
blastpgp -i -k -p patseedp

where -i is followed by the file containing the query in FASTA format
where -k is followed by the file containing the pattern in a syntax given below
and "patseedp" indicates the mode of usage, not representing any file.

The syntax for the query sequence is FASTA format as for all other
BLAST queries. The syntax for patterns follows the rules of
PROSITE and is documented in detail below.
The specified pattern is not required to be in the PROSITE list.
Most of the other BLAST flags can be used with PHI-BLAST.
One important exception is that PHI-BLAST requires gapped
alignments (i.e. forbids -g F in the flags) because ungapped
alignments do not make sense for almost all patterns in PROSITE.

There is a second mode of PHI-BLAST usage that is important when
the specified pattern occurs more than 1 time in the query.
In this case, the user may be interested in restricting the
search for local alignments to a subset of the pattern occurrences.
This can be done with a search that looks like:
blastpgp -i -k -p seedp

in which case the use of the "seedp" option requires the user to
specify the location(s) of the interesting pattern occurrence(s)
in the pattern file. The syntax for how to specify pattern
occurrences is below. When there are multiple pattern occurrences in the
query it may be important to decide how many are of interest because
the E-value for matches is effectively multiplied by the number
of interesting pattern occurrences.

The PHI-BLAST Web page supports only the "patseedp" option.

PHI-BLAST is integrated with PSI-BLAST. In the command-line
mode, PSI-BLAST can be invoked by using the -j option, as usual.
When this is done as:
blastpgp -i -k -p patseedp -j

then the first round of searching uses PHI-BLAST and all subsequent
rounds use PSI-BLAST.
In the Web page setting, the user must explicitly invoke one round
at a time, and the PHI-BLAST Web page provides the option to
initiate a PSI-BLAST round with the PHI-BLAST results.
To describe a combined usage, use the term "PHI-PSI-BLAST"
(Pattern-Hit Initiated, Position-Specific Iterated BLAST).

Determining statistical significance.

When a query sequence Q matches a database sequence D in PHI-BLAST,
it is useful to subdivide Q and D into 3 disjoint pieces
Qleft Qpattern Qright
Dleft Dpattern Dright

The substrings Qpattern and Dpattern contain the pattern specified
in the pattern file. The pieces Qpattern and Dpattern are aligned
and that alignment is displayed as part of the PHI-BLAST output,
but the score for that alignment is mostly ignored.
The "reduced" score r of an alignment is the sum of the scores obtained
by aligning Qleft with Dleft and by aligning Qright with Dright.

The expected number of alignments with a reduced score >= x
is given by:
CN(Lambda*x + 1)e^(-Lambda *x)
where:

C and Lambda are "constants" depending on the score matrix and the
gap costs.
N is (number of occurrences of pattern in database) * (number of
occurrences of pattern in Q)
e is the base of the natural logarithm.

It is important to understand that this method of computing
the statistical significance of a PHI-BLAST alignment is mathematically
different from the method used for BLAST and PSI-BLAST alignments.
However, both methods provide E-values, so they the E_values are
displayed with a similar output syntax.

Rules for pattern syntax for PHI-BLAST.

The syntax for patterns in PHI-BLAST follows the conventions
of PROSITE. When using the stand-alone program, it
is permissible to have multiple patterns in a file separated
by a blank line between patterns. When using the Web-page
only one pattern is allowed per query.

Valid protein characters for PHI-BLAST patterns:
ABCDEFGHIKLMNPQRSTVWXYZU

Valid DNA characters for PHI-BLAST patterns:
ACGT

Other useful delimiters:
    [ ]    means any one of the characters enclosed in the brackets
        e.g., [LFYT] means one occurrence of L or F or Y or T
    -      means nothing (this is a spacer character used by PROSITE)
    x with nothing following means any residue
    x(5) means 5 positions in which any residue is allowed (and similarly for any other
          single number in parentheses after x)
    x(2,4) means 2 to 4 positions where any residue is allowed,
           and similarly for any other two numbers separated by a comma;
           the first number should be < the second number.
    >      can occur only at the end of a pattern and means nothing
           it may occur before a period
           (another spacer used by PROSITE)

. may be used at the end of the pattern and means nothing

When using the stand-alone program, the pattern should
be in a file, with the first line starting:
ID
followed by 2 spaces and a text string giving the pattern a name.

There should also be a line starting
PA
followed by 2 spaces followed by the pattern description.

All other PROSITE codes in the first two columns are allowed,
but only the HI code, described below is relevant to PHI-BLAST.

Here is an example from PROSITE.

ID   CNMP_BINDING_2; PATTERN.
AC   PS00889;
DT   OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE   Cyclic nucleotide-binding domain signature 2.
PA   [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR   /RELEASE=32,49340;
NR   /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=1; /PARTIAL=1;
CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

The line starting
ID
gives the pattern a name.
The lines starting
AC, DT, DE, NR, NR, CC
are relevant to PROSITE users, but irrelevant to PHI-BLAST.
These lines are tolerated, but ignored by PHI-BLAST.

The line starting
     PA
describes the pattern as:
      one of LIVMF
followed by
      G
followed by
      E
followed by
      any single character
followed by
      one of GAS
followed by
      one of LIVM
followed by
      any 5 to 11 characters
followed by
      R
followed by
      one of STAQ
followed by
      A
followed by
      any single character
followed by
      one of LIVMA
followed by
      any single character
followed by
      one of STACV

In this case the pattern ends with a period.
It can end with nothing after the last specifying symbol
or any number of > signs or periods or combination thereof.

Here is another example, illustrating the use of an HI line.

ID ER_TARGET; PATTERN.
PA [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)

In this example, the HI lines specify that the pattern
occurs twice, once from positions 19 through 22 in the
sequence and once from positions 201 through 204 in the
sequence.
These specifications are relevant when stand-alone PHI-BLAST is
used with the
seedp
option, in which the interesting occurrences of the pattern
in the sequence are specified. In this case the
HI lines specify which occurrence(s) of the pattern
should be used to find good alignments.

In general, the seedp option is more useful than the
standard patternp option ONLY when the
pattern occurs K > 1 times in the sequence AND
the user is interested in matching to J < K of those
occurrences.
Then using the HI lines enables the user to specify which
occurrences are of interest.

Additional functionality related to PHI-BLAST.

PHI-BLAST takes as input both a sequence and a query containing
that sequence and searches a sequence database for
other sequences containing the same pattern and having a good alignment.
One may be interested in asking two related, simpler questions:

1. Given a sequence and a database of patterns, which patterns occur
in the sequence and where?

2. Given a pattern and a sequence database, which sequences contain the
pattern and where?

These queries can be answered wih software closely related to PHI-BLAST,
but they do not fit into the output framework of BLAST because the
answers are simple lists without alignments and with no notion of
statistical significance.

The NCBI toolbox includes another program, currently called
seedtop
to answer the two queries above.

Query 1 can be asked with:
seedtop -i -k -p patmatchp

Query 2 can be asked with:
seedtop -d -k -p patternp

The -k argument is used similarly in all queries and the file
format is always the same. The standard pattern database is
PROSITE, but others (or a subset) can be used.
There are plans afoot to offer the patmatchp query (number 1) on
the PHI-BLAST web page or in its vicinity, but this would
be restricted to having PROSITE as the pattern database.

References

     Zhang, Zheng, Alejandro A. Sch鋐fer, Webb Miller, Thomas L. Madden,
     David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),
     "Protein sequence similarity searches using patterns as seeds", Nucleic
     Acids Res. 26:3986-3990.

     Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
     Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
     "Gapped BLAST and PSI-BLAST: a new generation of protein database
     search programs", Nucleic Acids Res. 25:3389-3402.

     Karlin, Samuel and Stephen F. Altschul (1990). Methods for
     assessing the statistical significance of molecular sequence
     features by using general scoring schemes. Proc. Natl. Acad.
     Sci. USA 87:2264-68.

     Karlin, Samuel and Stephen F. Altschul (1993). Applications
     and statistics for multiple high-scoring segments in molecu-
     lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.

     Schaffer, Alejandro A., L. Aravind, Thomas L. Madden, Sergei Shavirin
     John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
     Improving PSI-BLAST Protein Database Search Sensitivity with Composition-Based
     Statistics and Other Refinements. Nucleic Acids Res. 29:2994-3005.

Release History
---------------
Notes for the 2.2.8 release:

* Correction to tblastx alignment computation

Notes for 2.2.7 release (2/2/04):

* Standalone BLAST is now available for amd64-linux.

* formatdb now restricts volume sizes to 1G on 32-bit platforms
for performance reasons.

* The -A option has been removed from formatdb, that is, all databases
will be created with ASN.1 deflines.

* tblastn query concatenation now works correctly on 64-bit platforms.

* The wwwblast source code has been merged into the C toolkit tree and
is no longer distributed with the binaries.

Notes for 2.2.6 release (4/9/03):

Enhancements:

1.) A -B option now exists for blastall that specifies the concatenation of queries
for blastn and tblastn. This option is still experimental and subject to change.
It is not supported with XML, ASN.1, or tabular output.

2.) Text and binary SeqAligns can now be produced in place of the standard BLAST report by
using (respectively) "-m 10" or "-m 11".

Bug fixes:

1.) A problem with an integer "rollover" in formatdb has been fixed. This happened when the
volume size was selected with the -v option and the specified number of bases became negative and
the option was ignored.

2.) A problem in the statistics of the BLAST output footer was fixed. This was a double-counting of
the number of extensions performed.

3.) A problem that caused the target and query sequences to be reversed in tblastx XML output has been fixed.

4.) A memory corruption problem in the formatting of the tabular output has been fixed.

5.) An unstable sorting problem in the results for tblastx searches has been fixed.

6.) A spurious error message about a file called "taxdb.bti" has been suppressed.

7.) A problem with the number of hits returned in XML mode being double what they should be has been fixed.

8.) The fastacmd return values have been corrected, it is 0 on success and 1 for an error.

posted on 2004-12-17 21:10 tony 阅读(2116) 评论(3) 收藏举报

刷新页面返回顶部

导航

README for stand-alone BLAST