《Lucene In Action》第三章.搜索

主要的类

IndexSearcher：搜索的主类。

Query（及具体子类）：被传入IndexSearcher的search方法，用于逻辑上的搜索。

QueryParser：将人工输入的查询字符串转化成Query对象。

TopDocs：存储着得分最高的那些文档，由IndexSearcher的search方法返回。

ScoreDoc：TopDocs中的每一个文档，他们只保留着Document的引用。

3.1 实现简单的索引功能

可以通过符合Lucene的字符串或者Query的组合实现复杂的查询，即QueryParser接受Query对象组合或者字符串两形式。

Term

在某一个具体字段(Field)上搜索。

一个简单的搜索例子：

public class BasicSearchingTest extends TestCase {
public void testTerm() throws Exception {
    IndexSearcher searcher;
    Directory dir = TestUtil.getBookIndexDirectory(); //A
    searcher = new IndexSearcher(dir,   //B
                                 true); //B

    Term t = new Term("subject", "ant");
    Query query = new TermQuery(t);
    TopDocs docs = searcher.search(query, 10);
    assertEquals("JDwA", 1, docs.totalHits);                         //C

    t = new Term("subject", "junit");
    docs = searcher.search(new TermQuery(t), 10);
    assertEquals(2, docs.totalHits);                                 //D

    searcher.close();
}
}

构造Term是较为关键的步骤。

QueryParser

可以使用QueryParser将String类型的查询串转化成Query对象，支持OR或者+ -这种。

构造函数是

QueryParser(Version matchVersion, String field, Analyzer analyzer)

matchVersion就Version.LUCENE_CURRENT吧。

field是默认的搜索字段。

analyser：只有在QueryParser中才使用analyzer，将对查询字符串进行处理。

QueryParser的parser将解析并生成Query对象。

public Query parse(String query) throws ParseException

解析失败将抛出异常，否则返回Query对象。

如果query包含多个词，默认使用OR连接各词。

常用的String组合：

java ：只搜索Java

java junit 或者 java OR junit ：搜索包含java或者junit的，在默认字段

+java +junit 或者java AND junit：搜索包含java并且junit的，在默认字段

title:ant ：搜索title字段包含ant的

title:extreme –subject:sports 或者 title:extreme AND NOT subject:sports：搜索title字段包含extreme并且subject不好喊sports的。

title:"junit in action"：搜索title中精确包含"junit in action"的。

java*：搜索java开头的，例如javascript java.net等

java~：搜索java相近的例如lava

lastmodified: [1/1/04 TO 12/31/04] ：搜索lastmodified字段在两个日期之间的。

总之是很强大的，上述String的query均可以用Query对象组合而形成。

3.2 使用IndexSearcher

使用IndexSearch需要三个步骤：

Directory dir = FSDirectory.open(new File("/path/xxx"));

IndexReader reader = IndexReader.open(dir);

IndexSearcher searcher = new IndexSearcher(reader);

IndexReader封装了底层的API操作，reader的open操作非常耗费资源，因此reader应该重用。

但是reader打开后便不能获悉之后更新的Index，因此可reopen：

reopen将尝试尽量重用，如果无法重用将创建新的IndexReader，因此需要判断。

IndexReader newReader = reader.reopen();
    if (reader != newReader) {
      reader.close();
      reader = newReader;
      searcher = new IndexSearcher(reader);
    }

执行搜索

IndexSearcher提供了很多API，下述几个均可以。

TopDocs search(Query query, int n)

TopDocs search(Query query, Filter filter, int n)

TopFieldDocs search(Query query, Filter filter, int n, Sort sort)

TopDocs

多数search直接返回一个TopDocs作为搜索的结果（已经按照相似度排序），它包含三个属性（方法）：

totalHits：有多少个Document被匹配

scoreDocs：每一个具体的搜索结果（含分、Document等）

结果的分页

在Lucene中，常用的解决方法有：

1、在第一次就把很多结果都抓取过来，然后根据用户的分页请求来显示

2、每次重新查询

一般来说，Web是“无状态协议”，重新查询可回避状态的存储，是一种较好的选择。每次用户选择后面的页后，将“n”的数值加大，即可显示后面的内容。

“实时搜索”

实时搜索的关键是：不要自己创建Directory->IndexReader，而是使用下述办法：

IndexWriter.getReader()：这可以不需要重新commit 索引就立即获得更新。

IndexReader newReader = reader.reopen()：重用reader，比起open非常快捷，但是注意如果reader!=oldReader，则需要关闭oldReader。

3.3 理解得分"Score"

Lucene使用得分Score来衡量Document与Query的匹配程度。

得分公式

关于分数的推导，有详细的说明，请参考《Lucene打分公式的数学推导》

http://topic.csdn.net/u/20100308/21/3386acef-d853-4738-9941-2a8b0ee157ca.html

其中各个因子的作用为：

tf(t in d)：Term t在文档d中出现的词频

idf(t)：Term t在几篇文档中出现过

norm(t, d)：标准化因子，它包括三个参数：

Document boost：此值越大，说明此文档越重要。

Field boost：此域越大，说明此域越重要。

lengthNorm(field) = (1.0 / Math.sqrt(numTerms))：一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。

boost(t.field in d)：额外的提升

coord(q, d)：主要用于AND查询时，符合多个的Term比其他的有更高的得分

queryNorm(q)：计算每个查询条目的方差和，此值并不影响排序，而仅仅使得不同的query之间的分数可以比较。

通过Boost可以提升某文档的位置，相似性可以通过拓展Similarity来实现。

使用explain来理解得分

尽管公式非常复杂，但是可以使用内置的expalin()函数来理解得分。

Explanation explanation = searcher.explain(Quert, Document);

explanation可以获取详细的每一步的评分。

3.4 Lucene提供的多种Query

TermQuery

某个字段的检索
    IndexSearcher searcher = new IndexSearcher(TestUtil.getBookIndexDirectory());

    Term t = new Term("isbn", "1930110995");
    Query query = new TermQuery(t);
    TopDocs docs = searcher.search(query, 10);
    assertEquals("JUnit in Action", 1, docs.totalHits);

    searcher.close();

TermRangeQuery

因为是按照字典序排列的，所以Lucene中很容易通过"Range"即范围来检索。

Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
TermRangeQuery query = new TermRangeQuery("title2", "d", "j", true, true);

TopDocs matches = searcher.search(query, 100);
assertEquals(3, matches.totalHits);
searcher.close();
dir.close();

两个true、true分别代表了是否包含d j两点。

也可以对不连续的进行选择，使用Collator，但性能很差。

NumericRangeQuery

与RangeQuery类似，只不过是对数值进行范围检索

Directory dir = TestUtil.getBookIndexDirectory();
IndexSearcher searcher = new IndexSearcher(dir);
// pub date of TTC was October 1988
NumericRangeQuery query = NumericRangeQuery.newIntRange("pubmonth",
                                                            198805,
                                                            198810,
                                                            true,
                                                            true);

TopDocs matches = searcher.search(query, 100);
assertEquals(1, matches.totalHits);
searcher.close();
dir.close();

PrefixQuery

前缀搜索，只检索前缀为xxx字符串的匹配结果。

IndexSearcher searcher = new IndexSearcher(TestUtil.getBookIndexDirectory());

// search for programming books, including subcategories
Term term = new Term("category",                              //#A
                         "/technology/computers/programming");    //#A
PrefixQuery query = new PrefixQuery(term);                    //#A

TopDocs matches = searcher.search(query, 10);                 //#A
int programmingAndBelow = matches.totalHits;

// only programming books, not subcategories
matches = searcher.search(new TermQuery(term), 10);           //#B
int justProgramming = matches.totalHits;

assertTrue(programmingAndBelow > justProgramming);
searcher.close();

BooleanQuery

与、或、非的将其他Query组合起来。

public void add(Query query, BooleanClause.Occur occur)

通过occour设置AND、OR或NOT

AND：occour设置为Occur.MUST

OR：occour设置为Occur.SHOULD

NOT：occour设置为Occur.MUST_NOT

PhraseQuery

PhraseQuery支持多个关键字的搜索。

slop用于表示“距离”，设定PhraseQuery的slop可控制多关键词的检索。

例如对于Field：

doc.add(new Field("field", "the quick brown fox jumped over the lazy dog", Field.Store.YES, Field.Index.ANALYZED));

相连的两词，将总被检索出来，无论slop为多少：

PhraseQuery query = new PhraseQuery();

query.add(new Term("field", "quick"));

query.add(new Term("field", "brown"));

可以被检索出来

再例如，brown,quick与原Doc的距离为3（注意顺序也有影响），则当slop大于等于3的时候才能被检索出来。

再例如下述PhraseQuery的检索结果。

    assertFalse("not close enough",
        matched(new String[] {"quick", "jumped", "lazy"}, 3));

    assertTrue("just enough",
        matched(new String[] {"quick", "jumped", "lazy"}, 4));

    assertFalse("almost but not quite",
        matched(new String[] {"lazy", "jumped", "quick"}, 7));

    assertTrue("bingo",
        matched(new String[] {"lazy", "jumped", "quick"}, 8));

slop实际是移动距离：将一个Query经过移动多少步可以符合另一个。

WildcardQuery：通配符查询

Query query = new WildcardQuery(new Term("contents", "?ild*"));

WildcardQuery面临着较为严重的性能问题：当前缀（*?之前）较长时，需要遍历的term将减少，反之极端，在开头使用通配符将导致遍历所有term。

FuzzyQuery：模糊查询

使用了“编辑距离”：number of character deletions, insertions, or substitutions required to transform one string to the other string.

如下所示：

indexSingleFieldDocs(new Field[] { new Field("contents",
                                                 "fuzzy",
                                                 Field.Store.YES,
                                                 Field.Index.ANALYZED),
                                       new Field("contents",
                                                 "wuzzy",
                                                 Field.Store.YES,
                                                 Field.Index.ANALYZED)
                                     });

    IndexSearcher searcher = new IndexSearcher(directory);
    Query query = new FuzzyQuery(new Term("contents", "wuzza"));
    TopDocs matches = searcher.search(query, 10);
    assertEquals("both close enough", 2, matches.totalHits);
    assertTrue("wuzzy closer than fuzzy",
               matches.scoreDocs[0].score != matches.scoreDocs[1].score);

    Document doc = searcher.doc(matches.scoreDocs[0].doc);

使用FuzzyQuery，则wuzzy可以匹配wuzzy，也可以匹配fuzzy。

FuzzyQuery不接受“距离”，而是接受0~1之间的一个“阈值”。

例如构造函数：

FuzzyQuery(Term term, float minimumSimilarity, int prefixLength)

当编辑距离小于minimumSimilarity*(Length(term)-prefixLength)的时候，则认为匹配FuzzyQuery。

FuzzyQuery将枚举索引中全部的Term，比较耗费资源！！

MatchAllDocsQuery

MatchAllDocsQuery将匹配索引中所有的Doc，Boost值默认都是1.0，并支持按照某field计算Boost数值。

3.5 QueryParser

尽管通过QueryAPI可以创建强大的查询，但是不需要完全从API创建起来Query，也可以通过

String -> QueryParser解析->Query的方法。

例如：

+pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache)

在String的Query字符串中，下述字符需要转移，在字符前面加上‘\’：

\ + - ! ( ) : ^ ] { } ~ * ?

对于一个Query对象，

Query.toString()可以显示其String类型的Query表示。

例如：

query.add(new FuzzyQuery(new Term("field", "kountry")),
BooleanClause.Occur.MUST);

query.add(new TermQuery(new Term("title", "western")),
BooleanClause.Occur.SHOULD);

的toString为：

+field:kountry~0.5 title:western

注意：FuzzyQuery的默认相似编辑距离为0.5。

TermQuery

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
"subject", analyzer);

将被解析为：

term: subject:computers

即field:term的形式

TermRangeQuery

范围搜索的String

title2:[K TO N] //两边都包含

title2:{K TO Mindstorms} //两边都不包含

NumericQuery和DateQuery

QueryParser不提供将String解析成NumericQuery或者DateQuery，需要通过继承QueryParser，在子类中实现。（见6.3.3和6.3.4）

前缀查询和通配符查询

Query q = new QueryParser(Version.LUCENE_CURRENT,
"field", analyzer).parse("PrefixQuery*");

默认情况下，其String为

prefixquery*

即默认全部小写化。

可以通过这样控制不小写

qp.setLowercaseExpandedTerms(false);

布尔查询

AND OR NOT必须大写。

默认情况下，空格表示OR。

abc xyz => abc OR xyz

可以更改默认操作符：

parser.setDefaultOperator(QueryParser.AND_OPERATOR);

则之后的

abc xyz => abc AND xyz

也可以使用缩写形式即+-表示。

a AND b == +a +b

a OR b == a b

a AND NOT b == +a –b

注意NOT之前必须至少有一个非NOT的操作符，即不能单独使用NOT word来找不含word的所有Doc。

PhraseQuery

将String的Query放在双引号“”内可创建一个QueryParser，用于将上述各种的Query的组合进行解析。

注意一定要用引号“”包围！

例如下述

This is Some Phrase*

将被解析为TermQuery，并非WildQuery，

而下述才可以：

\"This is Some Phrase*\"

但在此例子中，This、is将作为stop words被过滤。

双引号外面的~N可以设置slop数值，例如：

\"sloppy phrase\"~5 表示slop的数值时5（用于PhraseQuery）

FuzzyQuery

在Term后置~表示模糊查询，即FuzzyQuery。

例如：

Query query = parser.parse("kountry~");

或者

query = parser.parse("kountry~0.7");

MatchAllDocsQuery

*:*表示MatchAllDocsQuery，即匹配所有Document。

Grouping

Query query = new QueryParser(
        Version.LUCENE_CURRENT,
        "subject",
        analyzer).parse("(agile OR extreme) AND methodology");

字段选择

boost单个Term

^Float可以提升Term的Boost数值例如：

junit^2.0 testing

将junit的Boost提高一倍而testing不变

四号程序员

Keep It Simple and Stupid