Sphinx 1.10的测试

编译Sphinx 1.10之后,看看它到底是怎么用的。
总的来说,检索系统就是建索引和搜索两个过程。
由于我们不准备使用MySQL引擎的部分,数据源采用XML接口,因此和官方文档中出入较大。

1、配置Sphinx

cd /usr/local/sphinx/etc
sudo cp sphinx.conf.dist sphinx.conf

#编辑配置文件
sudo vim sphinx.conf

#xml数据源配置
source src1
{
    type			= xmlpipe
    # xml数据源的位置为/usr/local/sphinx/var/test.xml
    xmlpipe_command		= cat /usr/local/sphinx/var/test.xml

    # xmlpipe2 field 和 attr 的定义可以再test.xml中写schema就行了,这里可以省略

    #这个配置两可
    xmlpipe_fixup_utf8	= 1
}

#索引配置
index test1
{
     #索引类型:plain,distributed(分布式)和rt(实时)
    type	= plain
    #与上面配置的数据源src1相关联
    source			= src1
    #index存放的位置,注意在data下,再建一层文件夹
    path			= /usr/local/sphinx/var/data/test1
    #doc信息外置存储
    docinfo			= extern
    #lock锁,保持默认
    mlock			= 0
    #预处理器,如stemmer归一化等
    morphology		= none
    #xml数据源要求必须是utf-8
    charset_type		= utf-8
}

#搜索配置

附上数据源test.xml

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>

<sphinx:document id="1234">
	<content>this is the main content <![CDATA[[and this<cdata> entry must be handled properly by xml parser lib]]></content>
	<published>1012325463</published>
	<subject>note how field/attr tag can be in <b class="red">randomized</b> order. Test link <a href="http://soh0.info">搜狐百科</a></subject>
	<misc>some undeclared element</misc>
</sphinx:document>

<sphinx:document id="1235">
	<subject>another subject</subject>
	<content>here comes another document, and i am given to understand,	that in-document field order must not matter,sir</content>
	<published>1012325467</published>
</sphinx:document>

</sphinx:docset>

2、建索引

#我们只建立test1索引
sudo /usr/local/sphinx/indexer test1
#过程
Sphinx 1.10-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/sphinx/etc/sphinx.conf'...
indexing index 'test1'...
WARNING: source 'src1': unknown field/attribute 'misc'; ignored (line=15, pos=1, docid=0)
WARNING: source 'src1': unexpected string 'some undeclared element' (line=15, pos=7) inside <sphinx:document>
collected 2 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 2 docs, 264 bytes
total 0.000 sec, 328767 bytes/sec, 2490.66 docs/sec
total 3 reads, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
total 9 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg

3、搜索
首先要启动搜索服务

#启动
sudo /usr/local/sphinx/bin/searchd

#过程
Sphinx 1.10-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/sphinx/etc/sphinx.conf'...
listening on all interfaces, port=9312
precaching index 'test1'
precached 1 indexes in 0.000 sec

然后测试搜索一下

#测试搜索词”must“
/usr/local/sphinx/bin/search must

#搜索结果
Sphinx 1.10-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/sphinx/etc/sphinx.conf'...
index 'test1': query 'must ': returned 2 matches of 2 total in 0.000 sec

displaying matches:
1. document=1234, weight=1356, published=Wed Jan 30 01:31:03 2002, author_id=1
2. document=1235, weight=1356, published=Wed Jan 30 01:31:07 2002, author_id=1

words:
1. 'must': 2 documents, 2 hits

总体来说上手还是比较容易的,但是用好可就要复杂多了,比如如何给中文文档建索引,还得好好研究一下,内置的分词器效果很烂,而对于CoreSeek这种修改版还是不太放心。

3 thoughts on “Sphinx 1.10的测试

  1. coder4 Post author

    @志达: 其实我不认为c++的性能一定会比Java好,不过Sphinx是支持分布式的,性能不够了可以随时拓展。Sphinx已经又10年历史了,单机跑TB规模是常态,较大的Sphinx集群已经有超过160亿篇文档了。我比较喜欢它的地方主要在于省资源,适合于vps这种环境。它的修改版甚至可以运行于手机上……

    Reply
  2. coder4 Post author

    @志达: 原来solar也是支持分布式的,guluoguawenle了~哈哈。。

    Reply

Leave a Reply to coder4 Cancel reply

Your email address will not be published. Required fields are marked *