试试sphinx - 全文检索工具

以前在做用户通讯录检索的功能，检索很多字段使用or，如果使用全文检索功能就方便多了，但不知道oracle的全文检索功能有没有那么强。不过可以尝试下。

在全文检索世界里，lucene的大名如雷贯耳，大家应该都有所了解，但sphinx可谓后起之秀，和MYSQL结合更是所向披靡。其实sphinx不仅支持当下各种流行的数据库，还支持文本检索。

今天就来玩玩在windows下玩玩sphinx+MYSQL，安装MYSQL就先略过了。

安装sphinx

其实很多开源的东东在windows下的安装都是“绿色无污染”的，从官网下载sphinx最新版Sphinx 2.0.4-release。解压并把sphinx放在D:\MyTemp\sphinx.
解压后的目录里含有example.sql文件，它是用使用MYSQL默认的test数据库来测试的，先使用phpmyadmin或mysql命令来导入：

msyql -u test < D:\MyTemp\sphinx\example.sql

配置sphinx

在解压后的目录里有两个文件配置文件模板：sphinx.conf.in和sphinx-min.conf.in来参考。

我来简单注释下sphinx.conf.in，因为它的配置比较全面些：

#
# Sphinx configuration file sample
#
# WARNING! While this sample file mentions all available options,
# it contains (very) short helper descriptions only. Please refer to
# doc/sphinx.html for details.
#
 
#############################################################################
## data source definition
#############################################################################
##索引源##
source src1
{
 # data source type. mandatory, no default value
 # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
 type   = mysql #数据源类型
 
 #####################################################################
 ## SQL settings (for 'mysql' and 'pgsql' types)
 #####################################################################
 
 # some straightforward parameters for SQL source types
 sql_host  = localhost #mysql服务器
 sql_user  = test #mysql用户名
 sql_pass  = #mysql用户名
 sql_db   = test #mysql数据库名
 sql_port  = 3306 # optional, default is 3306 mysql端口，默认是3306
 
 # UNIX socket name
 # optional, default is empty (reuse client library defaults)
 # usually '/var/lib/mysql/mysql.sock' on Linux
 # usually '/tmp/mysql.sock' on FreeBSD
 #
 # sql_sock  = /tmp/mysql.sock
 
 
 # MySQL specific client connection flags
 # optional, default is 0
 #
 # mysql_connect_flags = 32 # enable compression
 
 # MySQL specific SSL certificate settings
 # optional, defaults are empty
 #
 # mysql_ssl_cert  = /etc/ssl/client-cert.pem
 # mysql_ssl_key  = /etc/ssl/client-key.pem
 # mysql_ssl_ca  = /etc/ssl/cacert.pem
 
 # MS SQL specific Windows authentication mode flag
 # MUST be in sync with charset_type index-level setting
 # optional, default is 0
 #
 # mssql_winauth  = 1 # use currently logged on user credentials
 
 
 # MS SQL specific Unicode indexing flag
 # optional, default is 0 (request SBCS data)
 #
 # mssql_unicode  = 1 # request Unicode data from server
 
 
 # ODBC specific DSN (data source name)
 # mandatory for odbc source type, no default value
 #
 # odbc_dsn  = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
 # sql_query  = SELECT id, data FROM documents.csv
 
 
 # ODBC and MS SQL specific, per-column buffer sizes
 # optional, default is auto-detect
 #
 # sql_column_buffers = content=12M, comments=1M
 
 
 # pre-query, executed before the main fetch query
 # multi-value, optional, default is empty list of queries
 #
 # sql_query_pre  = SET NAMES utf8 #mysql检索编码，特别注意这点，很多人中文检索不到，就是因为数据库编码问题。
 # sql_query_pre  = SET SESSION query_cache_type=OFF
 
 
 # main document fetch query
 # mandatory, integer document ID field MUST be the first selected column
 #获取数据的SQL语句
 sql_query  = \
  SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
  FROM documents
 
 
 # joined/payload field fetch query
 # joined fields let you avoid (slow) JOIN and GROUP_CONCAT
 # payload fields let you attach custom per-keyword values (eg. for ranking)
 #
 # syntax is FIELD-NAME 'from'  ( 'query' | 'payload-query' ); QUERY
 # joined field QUERY should return 2 columns (docid, text)
 # payload field QUERY should return 3 columns (docid, keyword, weight)
 #
 # REQUIRES that query results are in ascending document ID order!
 # multi-value, optional, default is empty list of queries
 #
 # sql_joined_field = tags from query; SELECT docid, CONCAT('tag',tagid) FROM tags ORDER BY docid ASC
 # sql_joined_field = wtags from payload-query; SELECT docid, tag, tagweight FROM tags ORDER BY docid ASC
 
 
 # file based field declaration
 #
 # content of this field is treated as a file name
 # and the file gets loaded and indexed in place of a field
 #
 # max file size is limited by max_file_field_buffer indexer setting
 # file IO errors are non-fatal and get reported as warnings
 #
 # sql_file_field  = content_file_path
 
 
 # range query setup, query that must return min and max ID values
 # optional, default is empty
 #
 # sql_query will need to reference $start and $end boundaries
 # if using ranged query:
 #
 # sql_query  = \
 # SELECT doc.id, doc.id AS group, doc.title, doc.data \
 # FROM documents doc \
 # WHERE id>=$start AND id<=$end
 #
 # sql_query_range  = SELECT MIN(id),MAX(id) FROM documents
 
 
 # range query step
 # optional, default is 1024
 #
 # sql_range_step  = 1000
 
 
 # unsigned integer attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # optional bit size can be specified, default is 32
 #
 # sql_attr_uint  = author_id
 # sql_attr_uint  = forum_id:9 # 9 bits for forum_id
 sql_attr_uint  = group_id #过滤或条件查询的属性
 
 # boolean attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # equivalent to sql_attr_uint with 1-bit size
 #
 # sql_attr_bool  = is_deleted
 
 
 # bigint attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # declares a signed (unlike uint!) 64-bit attribute
 #
 # sql_attr_bigint  = my_bigint_id
 
 
 # UNIX timestamp attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # similar to integer, but can also be used in date functions
 #
 # sql_attr_timestamp = posted_ts
 # sql_attr_timestamp = last_edited_ts
 sql_attr_timestamp = date_added
 
 # string ordinal attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # sorts strings (bytewise), and stores their indexes in the sorted list
 # sorting by this attr is equivalent to sorting by the original strings
 #
 # sql_attr_str2ordinal = author_name
 
 
 # floating point attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # values are stored in single precision, 32-bit IEEE 754 format
 #
 # sql_attr_float  = lat_radians
 # sql_attr_float  = long_radians
 
 
 # multi-valued attribute (MVA) attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # MVA values are variable length lists of unsigned 32-bit integers
 #
 # syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
 # ATTR-TYPE is 'uint' or 'timestamp'
 # SOURCE-TYPE is 'field', 'query', or 'ranged-query'
 # QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
 # RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
 #
 # sql_attr_multi  = uint tag from query; SELECT docid, tagid FROM tags
 # sql_attr_multi  = uint tag from ranged-query; \
 # SELECT docid, tagid FROM tags WHERE id>=$start AND id<=$end; \
 # SELECT MIN(docid), MAX(docid) FROM tags
 
 
 # string attribute declaration
 # multi-value (an arbitrary number of these is allowed), optional
 # lets you store and retrieve strings
 #
 # sql_attr_string  = stitle
 
 
 # wordcount attribute declaration
 # multi-value (an arbitrary number of these is allowed), optional
 # lets you count the words at indexing time
 #
 # sql_attr_str2wordcount = stitle
 
 
 # combined field plus attribute declaration (from a single column)
 # stores column as an attribute, but also indexes it as a full-text field
 #
 # sql_field_string = author
 # sql_field_str2wordcount = title
 
 
 # post-query, executed on sql_query completion
 # optional, default is empty
 #
 # sql_query_post  =
 
 
 # post-index-query, executed on successful indexing completion
 # optional, default is empty
 # $maxid expands to max document ID actually fetched from DB
 #
 # sql_query_post_index = REPLACE INTO counters ( id, val ) \
 # VALUES ( 'max_indexed_id', $maxid )
 
 
 # ranged query throttling, in milliseconds
 # optional, default is 0 which means no delay
 # enforces given delay before each query step
 sql_ranged_throttle = 0
 
 # document info query, ONLY for CLI search (ie. testing and debugging)
 # optional, default is empty
 # must contain $id macro and must fetch the document by that id
 sql_query_info  = SELECT * FROM documents WHERE id=$id
 
 # kill-list query, fetches the document IDs for kill-list
 # k-list will suppress matches from preceding indexes in the same query
 # optional, default is empty
 #
 # sql_query_killlist = SELECT id FROM documents WHERE edited>=@last_reindex
 
 
 # columns to unpack on indexer side when indexing
 # multi-value, optional, default is empty list
 #
 # unpack_zlib  = zlib_column
 # unpack_mysqlcompress = compressed_column
 # unpack_mysqlcompress = compressed_column_2
 
 
 # maximum unpacked length allowed in MySQL COMPRESS() unpacker
 # optional, default is 16M
 #
 # unpack_mysqlcompress_maxsize = 16M
 
 
 #####################################################################
 ## xmlpipe2 settings
 #####################################################################
 
 # type   = xmlpipe
 
 # shell command to invoke xmlpipe stream producer
 # mandatory
 #
 # xmlpipe_command  = cat @CONFDIR@/test.xml
 
 # xmlpipe2 field declaration
 # multi-value, optional, default is empty
 #
 # xmlpipe_field  = subject
 # xmlpipe_field  = content
 
 
 # xmlpipe2 attribute declaration
 # multi-value, optional, default is empty
 # all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
 #
 # xmlpipe_attr_timestamp = published
 # xmlpipe_attr_uint = author_id
 
 
 # perform UTF-8 validation, and filter out incorrect codes
 # avoids XML parser choking on non-UTF-8 documents
 # optional, default is 0
 #
 # xmlpipe_fixup_utf8 = 1
}
 
 
# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
 sql_ranged_throttle = 100
}
 
#############################################################################
## index definition
#############################################################################
 
# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
# 索引
index test1
{
 # index type
 # optional, default is 'plain'
 # known values are 'plain', 'distributed', and 'rt' (see samples below)
 # type   = plain
 
 # document source(s) to index
 # multi-value, mandatory
 # document IDs must be globally unique across all sources
 source   = src1 #声明索引源
 
 # index files path and file name, without extension
 # mandatory, path must be writable, extensions will be auto-appended
 path   = @CONFDIR@/data/test1 #索引文件存放路径及索引的文件名
 
 # document attribute values (docinfo) storage mode
 # optional, default is 'extern'
 # known values are 'none', 'extern' and 'inline'
 docinfo   = extern #文档信息存储方式
 
 # memory locking for cached data (.spa and .spi), to prevent swapping
 # optional, default is 0 (do not mlock)
 # requires searchd to be run from root
 mlock   = 0 #缓存数据内存锁定
 
 # a list of morphology preprocessors to apply
 # optional, default is empty
 #
 # builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
 # 'soundex', and 'metaphone'; additional preprocessors available from
 # libstemmer are 'libstemmer_XXX', where XXX is algorithm code
 # (see libstemmer_c/libstemmer/modules.txt)
 #
 # morphology  = stem_en, stem_ru, soundex
 # morphology  = libstemmer_german
 # morphology  = libstemmer_sv
 morphology  = none #形态学（对中文无效）
 
 # minimum word length at which to enable stemming
 # optional, default is 1 (stem everything)
 #
 # min_stemming_len = 1 #索引的词最小长度
 
 
 # stopword files list (space separated)
 # optional, default is empty
 # contents are plain text, charset_table and stemming are both applied
 #
 # stopwords  = @CONFDIR@/data/stopwords.txt
 
 
 # wordforms file, in "mapfrom > mapto" plain text format
 # optional, default is empty
 #
 # wordforms  = @CONFDIR@/data/wordforms.txt
 
 
 # tokenizing exceptions file
 # optional, default is empty
 #
 # plain text, case sensitive, space insensitive in map-from part
 # one "Map Several Words => ToASingleOne" entry per line
 #
 # exceptions  = @CONFDIR@/data/exceptions.txt
 
 
 # minimum indexed word length
 # default is 1 (index everything)
 min_word_len  = 1
 
 # charset encoding type
 # optional, default is 'sbcs'
 # known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
 charset_type  = sbcs #数据编码
 
 # charset definition and case folding rules "table"
 # optional, default value depends on charset_type
 #
 # defaults are configured to include English and Russian characters only
 # you need to change the table to include additional ones
 # this behavior MAY change in future versions
 #
 # 'sbcs' default value is
 # charset_table  = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
 #
 # 'utf-8' default value is
 ##### 字符表，注意：如使用这种方式，则sphinx会对中文进行单字切分，
 ##### 即进行字索引，若要使用中文分词，必须使用其他分词插件如 coreseek，sfc
 # charset_table  = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
 
 
 # ignored characters list
 # optional, default value is empty
 #
 # ignore_chars  = U+00AD
 
 
 # minimum word prefix length to index
 # optional, default is 0 (do not index prefixes)
 #
 # min_prefix_len  = 0 #最小前缀
 
 
 # minimum word infix length to index
 # optional, default is 0 (do not index infixes)
 #
 # min_infix_len  = 0 #最小中缀
 
 
 # list of fields to limit prefix/infix indexing to
 # optional, default value is empty (index all fields in prefix/infix mode)
 #
 # prefix_fields  = filename
 # infix_fields  = url, domain
 
 
 # enable star-syntax (wildcards) when searching prefix/infix indexes
 # search-time only, does not affect indexing, can be 0 or 1
 # optional, default is 0 (do not use wildcard syntax)
 #
 # enable_star  = 1
 
 
 # expand keywords with exact forms and/or stars when searching fit indexes
 # search-time only, does not affect indexing, can be 0 or 1
 # optional, default is 0 (do not expand keywords)
 #
 # expand_keywords  = 1
 
 
 # n-gram length to index, for CJK indexing
 # only supports 0 and 1 for now, other lengths to be implemented
 # optional, default is 0 (disable n-grams)
 #
 # ngram_len  = 1 # 对于非字母型数据的长度切割
 
 
 # n-gram characters list, for CJK indexing
 # optional, default is empty
 #
 # ngram_chars  = U+3000..U+2FA1F
 
 
 # phrase boundary characters list
 # optional, default is empty
 #
 # phrase_boundary  = ., ?, !, U+2026 # horizontal ellipsis
 
 
 # phrase boundary word position increment
 # optional, default is 0
 #
 # phrase_boundary_step = 100
 
 
 # blended characters list
 # blended chars are indexed both as separators and valid characters
 # for instance, AT&T will results in 3 tokens ("at", "t", and "at&t")
 # optional, default is empty
 #
 # blend_chars  = +, &, U+23
 
 
 # blended token indexing mode
 # a comma separated list of blended token indexing variants
 # known variants are trim_none, trim_head, trim_tail, trim_both, skip_pure
 # optional, default is trim_none
 #
 # blend_mode  = trim_tail, skip_pure
 
 
 # whether to strip HTML tags from incoming documents
 # known values are 0 (do not strip) and 1 (do strip)
 # optional, default is 0
 html_strip  = 0
 
 # what HTML attributes to index if stripping HTML
 # optional, default is empty (do not index anything)
 #
 # html_index_attrs = img=alt,title; a=title;
 
 
 # what HTML elements contents to strip
 # optional, default is empty (do not strip element contents)
 #
 # html_remove_elements = style, script
 
 
 # whether to preopen index data files on startup
 # optional, default is 0 (do not preopen), searchd-only
 #
 # preopen   = 1
 
 
 # whether to keep dictionary (.spi) on disk, or cache it in RAM
 # optional, default is 0 (cache in RAM), searchd-only
 #
 # ondisk_dict  = 1
 
 
 # whether to enable in-place inversion (2x less disk, 90-95% speed)
 # optional, default is 0 (use separate temporary files), indexer-only
 #
 # inplace_enable  = 1
 
 
 # in-place fine-tuning options
 # optional, defaults are listed below
 #
 # inplace_hit_gap  = 0 # preallocated hitlist gap size
 # inplace_docinfo_gap = 0 # preallocated docinfo gap size
 # inplace_reloc_factor = 0.1 # relocation buffer size within arena
 # inplace_write_factor = 0.1 # write buffer size within arena
 
 
 # whether to index original keywords along with stemmed versions
 # enables "=exactform" operator to work
 # optional, default is 0
 #
 # index_exact_words = 1
 
 
 # position increment on overshort (less that min_word_len) words
 # optional, allowed values are 0 and 1, default is 1
 #
 # overshort_step  = 1
 
 
 # position increment on stopword
 # optional, allowed values are 0 and 1, default is 1
 #
 # stopword_step  = 1
 
 
 # hitless words list
 # positions for these keywords will not be stored in the index
 # optional, allowed values are 'all', or a list file name
 #
 # hitless_words  = all
 # hitless_words  = hitless.txt
 
 
 # detect and index sentence and paragraph boundaries
 # required for the SENTENCE and PARAGRAPH operators to work
 # optional, allowed values are 0 and 1, default is 0
 #
 # index_sp   = 1
 
 
 # index zones, delimited by HTML/XML tags
 # a comma separated list of tags and wildcards
 # required for the ZONE operator to work
 # optional, default is empty string (do not index zones)
 #
 # index_zones  = title, h*, th
}
 
 
# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
index test1stemmed : test1
{
 path   = @CONFDIR@/data/test1stemmed
 morphology  = stem_en
}
 
 
# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
index dist1
{
 # 'distributed' index type MUST be specified
 type   = distributed
 
 # local index to be searched
 # there can be many local indexes configured
 local   = test1
 local   = test1stemmed
 
 # remote agent
 # multiple remote agents may be specified
 # syntax for TCP connections is 'hostname:port:index1,[index2[,...]]'
 # syntax for local UNIX connections is '/path/to/socket:index1,[index2[,...]]'
 agent   = localhost:9313:remote1
 agent   = localhost:9314:remote2,remote3
 # agent   = /var/run/searchd.sock:remote4
 
 # blackhole remote agent, for debugging/testing
 # network errors and search results will be ignored
 #
 # agent_blackhole  = testbox:9312:testindex1,testindex2
 
 
 # remote agent connection timeout, milliseconds
 # optional, default is 1000 ms, ie. 1 sec
 agent_connect_timeout = 1000
 
 # remote agent query timeout, milliseconds
 # optional, default is 3000 ms, ie. 3 sec
 agent_query_timeout = 3000
}
 
 
# realtime index example
#
# you can run INSERT, REPLACE, and DELETE on this index on the fly
# using MySQL protocol (see 'listen' directive below)
index rt
{
 # 'rt' index type must be specified to use RT index
 type   = rt
 
 # index files path and file name, without extension
 # mandatory, path must be writable, extensions will be auto-appended
 path   = @CONFDIR@/data/rt
 
 # RAM chunk size limit
 # RT index will keep at most this much data in RAM, then flush to disk
 # optional, default is 32M
 #
 # rt_mem_limit  = 512M
 
 # full-text field declaration
 # multi-value, mandatory
 rt_field  = title
 rt_field  = content
 
 # unsigned integer attribute declaration
 # multi-value (an arbitrary number of attributes is allowed), optional
 # declares an unsigned 32-bit attribute
 rt_attr_uint  = gid
 
 # RT indexes currently support the following attribute types:
 # uint, bigint, float, timestamp, string
 #
 # rt_attr_bigint  = guid
 # rt_attr_float  = gpa
 # rt_attr_timestamp = ts_added
 # rt_attr_string  = author
}
 
#############################################################################
## indexer settings
#############################################################################
######### 索引器配置 #####
indexer
{
 # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
 # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
 mem_limit  = 32M # 内存限制
 
 # maximum IO calls per second (for I/O throttling)
 # optional, default is 0 (unlimited)
 #
 # max_iops  = 40
 
 
 # maximum IO call size, bytes (for I/O throttling)
 # optional, default is 0 (unlimited)
 #
 # max_iosize  = 1048576
 
 
 # maximum xmlpipe2 field length, bytes
 # optional, default is 2M
 #
 # max_xmlpipe2_field = 4M
 
 
 # write buffer size, bytes
 # several (currently up to 4) buffers will be allocated
 # write buffers are allocated in addition to mem_limit
 # optional, default is 1M
 #
 # write_buffer  = 1M
 
 
 # maximum file field adaptive buffer size
 # optional, default is 8M, minimum is 1M
 #
 # max_file_field_buffer = 32M
}
 
#############################################################################
## searchd settings
#############################################################################
############ sphinx 服务进程 ########
searchd
{
 # [hostname:]port[:protocol], or /unix/socket/path to listen on
 # known protocols are 'sphinx' (SphinxAPI) and 'mysql41' (SphinxQL)
 #
 # multi-value, multiple listen points are allowed
 # optional, defaults are 9312:sphinx and 9306:mysql41, as below
 #
 # listen   = 127.0.0.1
 # listen   = 192.168.0.1:9312
 # listen   = 9312
 # listen   = /var/run/searchd.sock
 listen   = 9312 # 监听端口，在此版本开始，官方已在IANA获得正式授权的9312端口，以前版本默认的是3312
 listen   = 9306:mysql41
 
 # log file, searchd run info is logged here
 # optional, default is 'searchd.log'
 log   = @CONFDIR@/log/searchd.log # 服务进程日志 ，一旦sphinx出现异常，基本上可以从这里查询有效信息，轮换（rotate）出的问题一般可在此寻到答案
 
 # query log file, all search queries are logged here
 # optional, default is empty (do not log queries)
 query_log  = @CONFDIR@/log/query.log # 客户端查询日志，笔者注：若欲对一些关键词进行统计，可以分析此日志文件
 
 # client read timeout, seconds
 # optional, default is 5
 read_timeout  = 5 # 请求超时
 
 # request timeout, seconds
 # optional, default is 5 minutes
 client_timeout  = 300
 
 # maximum amount of children to fork (concurrent searches to run)
 # optional, default is 0 (unlimited)
 max_children  = 30 # 同时可执行的最大searchd 进程数
 
 # PID file, searchd process ID file name
 # mandatory
 pid_file  = @CONFDIR@/log/searchd.pid #进程ID文件
 
 # max amount of matches the daemon ever keeps in RAM, per-index
 # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
 # default is 1000 (just like Google)
 max_matches  = 1000 # 查询结果的最大返回数
 
 # seamless rotate, prevents rotate stalls if precaching huge datasets
 # optional, default is 1
 seamless_rotate  = 1 # 是否支持无缝切换，做增量索引时通常需要
 
 # whether to forcibly preopen all indexes on startup
 # optional, default is 1 (preopen everything)
 preopen_indexes  = 1
 
 # whether to unlink .old index copies on succesful rotation.
 # optional, default is 1 (do unlink)
 unlink_old  = 1
 
 # attribute updates periodic flush timeout, seconds
 # updates will be automatically dumped to disk this frequently
 # optional, default is 0 (disable periodic flush)
 #
 # attr_flush_period = 900
 
 
 # instance-wide ondisk_dict defaults (per-index value take precedence)
 # optional, default is 0 (precache all dictionaries in RAM)
 #
 # ondisk_dict_default = 1
 
 
 # MVA updates pool size
 # shared between all instances of searchd, disables attr flushes!
 # optional, default size is 1M
 mva_updates_pool = 1M
 
 # max allowed network packet size
 # limits both query packets from clients, and responses from agents
 # optional, default size is 8M
 max_packet_size  = 8M
 
 # crash log path
 # searchd will (try to) log crashed query to 'crash_log_path.PID' file
 # optional, default is empty (do not create crash logs)
 #
 # crash_log_path  = @CONFDIR@/log/crash
 
 
 # max allowed per-query filter count
 # optional, default is 256
 max_filters  = 256
 
 # max allowed per-filter values count
 # optional, default is 4096
 max_filter_values = 4096
 
 
 # socket listen queue length
 # optional, default is 5
 #
 # listen_backlog  = 5
 
 
 # per-keyword read buffer size
 # optional, default is 256K
 #
 # read_buffer  = 256K
 
 
 # unhinted read size (currently used when reading hits)
 # optional, default is 32K
 #
 # read_unhinted  = 32K
 
 
 # max allowed per-batch query count (aka multi-query count)
 # optional, default is 32
 max_batch_queries = 32
 
 
 # max common subtree document cache size, per-query
 # optional, default is 0 (disable subtree optimization)
 #
 # subtree_docs_cache = 4M
 
 
 # max common subtree hit cache size, per-query
 # optional, default is 0 (disable subtree optimization)
 #
 # subtree_hits_cache = 8M
 
 
 # multi-processing mode (MPM)
 # known values are none, fork, prefork, and threads
 # optional, default is fork
 #
 workers   = threads # for RT to work
 
 
 # max threads to create for searching local parts of a distributed index
 # optional, default is 0, which means disable multi-threaded searching
 # should work with all MPMs (ie. does NOT require workers=threads)
 #
 # dist_threads  = 4
 
 
 # binlog files path; use empty string to disable binlog
 # optional, default is build-time configured data directory
 #
 # binlog_path  = # disable logging
 # binlog_path  = @CONFDIR@/data # binlog.001 etc will be created there
 
 
 # binlog flush/sync mode
 # 0 means flush and sync every second
 # 1 means flush and sync every transaction
 # 2 means flush every transaction, sync every second
 # optional, default is 2
 #
 # binlog_flush  = 2
 
 
 # binlog per-file size limit
 # optional, default is 128M, 0 means no limit
 #
 # binlog_max_log_size = 256M
 
 
 # per-thread stack size, only affects workers=threads mode
 # optional, default is 64K
 #
 # thread_stack   = 128K
 
 
 # per-keyword expansion limit (for dict=keywords prefix searches)
 # optional, default is 0 (no limit)
 #
 # expansion_limit  = 1000
 
 
 # RT RAM chunks flush period
 # optional, default is 0 (no periodic flush)
 #
 # rt_flush_period  = 900
 
 
 # query log file format
 # optional, known values are plain and sphinxql, default is plain
 #
 # query_log_format  = sphinxql
 
 
 # version string returned to MySQL network protocol clients
 # optional, default is empty (use Sphinx version)
 #
 # mysql_version_string = 5.0.37
 
 
 # trusted plugin directory
 # optional, default is empty (disable UDFs)
 #
 # plugin_dir   = /usr/local/sphinx/lib
 
 
 # default server-wide collation
 # optional, default is libc_ci
 #
 # collation_server  = utf8_general_ci
 
 
 # server-wide locale for libc based collations
 # optional, default is C
 #
 # collation_libc_locale = ru_RU.UTF-8
 
 
 # threaded server watchdog (only used in workers=threads mode)
 # optional, values are 0 and 1, default is 1 (watchdog on)
 #
 # watchdog    = 1
 
 
 # SphinxQL compatibility mode (legacy columns and their names)
 # optional, default is 0 (SQL compliant syntax and result sets)
 #
 # compat_sphinxql_magics = 1
}
 
# --eof--

View Code

我直接修改配置文件sphinx-min.conf.in：

#
# Minimal Sphinx configuration sample (clean, simple, functional)
#
 
source src1
{
 type   = mysql
 
 sql_host  = localhost
 sql_user  = root
 sql_pass  = 123456
 sql_db   = test
 sql_port  = 3306 # optional, default is 3306
 
 sql_query  = \
  SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
  FROM documents
 
 sql_attr_uint  = group_id
 sql_attr_timestamp = date_added
 
 sql_query_info  = SELECT * FROM documents WHERE id=$id
}
 
 
index test1
{
 source   = src1
 path   = D:/MyTemp/sphinxdata/data/test1
 docinfo   = extern
 charset_type  = sbcs
}
 
 
index testrt
{
 type   = rt
 rt_mem_limit  = 32M
 
 path   = D:/MyTemp/sphinxdata/data/testrt
 charset_type  = utf-8
 
 rt_field  = title
 rt_field  = content
 rt_attr_uint  = gid
}
 
 
indexer
{
 mem_limit  = 32M
}
 
 
searchd
{
 listen   = 9312
 listen   = 9306:mysql41
 log   = D:/MyTemp/sphinxdata/log/searchd.log
 query_log  = D:/MyTemp/sphinxdata/log/query.log
 read_timeout  = 5
 max_children  = 30
 pid_file  = D:/MyTemp/sphinxdata/log/searchd.pid
 max_matches  = 1000
 seamless_rotate  = 1
 preopen_indexes  = 1
 unlink_old  = 1
 workers   = threads # for RT to work
 binlog_path  = D:/MyTemp/sphinxdata/data
}

View Code

另存为sphinx.conf。

建立索引

bin/indexer.exe是建立索引的程序，具体参数可在命令行下查看帮助。

使用命令：

indexer --config D:\MyTemp\sphinx\sphinx.conf test1

执行命令后的反馈：

Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
 
using config file 'D:\MyTemp\sphinx\sphinx.conf'...
indexing index 'test1'...
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 193 bytes
total 0.316 sec, 609 bytes/sec, 12.63 docs/sec
total 2 reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
total 9 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
Error in my_thread_global_end(): 1 threads didn't exit

View Code

开启查询的守护进程

使用命令：

searchd --config D:\MyTemp\sphinx\sphinx.conf

执行命令后的反馈：

WARNING: compat_sphinxql_magics=1 is deprecated; please update your application
and config
listening on all interfaces, port=9312
listening on all interfaces, port=9306
precaching index 'test1'
precaching index 'testrt'
precached 2 indexes in 0.012 sec

View Code

查询测试

可以打开example.sql看看数据库都有那些数据，我们来检索下'document'这个单词。命令如下：

search -c D:\MyTemp\sphinx\sphinx.conf document

执行命令后的结果：

Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
 
using config file 'D:\MyTemp\sphinx\sphinx.conf'...
index 'test1': query 'document ': returned 2 matches of 2 total in 0.000 sec
 
displaying matches:
1. document=1, weight=1557, group_id=1, date_added=Thu May 24 09:54:32 2012
        id=1
        group_id=1
        group_id2=5
        date_added=2012-05-24 09:54:32
        title=test one
        content=this is my test document number one. also checking search within
 phrases.
2. document=2, weight=1557, group_id=1, date_added=Thu May 24 09:54:32 2012
        id=2
        group_id=1
        group_id2=6
        date_added=2012-05-24 09:54:32
        title=test two
        content=this is my test document number two
 
words:
1. 'document': 2 documents, 2 hits
 
index 'testrt': search error: failed to open D:/MyTemp/sphinxdata/data/testrt.sp
h: No such file or directory.
Error in my_thread_global_end(): 1 threads didn't exit

View Code

posted @ 2012-05-24 10:06 码农神说阅读(341) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

码农神说

图解码农技术，大话码农故事，漫画感悟码农人生，助力码农翻身！

试试sphinx - 全文检索工具

公告