2010-01-07 96 views
0

给定文件{“富”,“酒吧”,“巴兹”},我想用SpanNearQuery与标记{“巴兹”,“额外”}Lucene的SpanNearQuery部分匹配

但是,这不能匹配。

我该如何解决这个问题?

样品测试(使用Lucene 2.9.1),结果如下:

  • givenSingleMatch - PASS
  • givenTwoMatches - PASS
  • givenThreeMatches - PASS
  • givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.document.Field; 
import org.apache.lucene.index.IndexReader; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.index.Term; 
import org.apache.lucene.search.IndexSearcher; 
import org.apache.lucene.search.TopDocs; 
import org.apache.lucene.search.spans.SpanNearQuery; 
import org.apache.lucene.search.spans.SpanQuery; 
import org.apache.lucene.search.spans.SpanTermQuery; 
import org.apache.lucene.store.RAMDirectory; 
import org.apache.lucene.util.Version; 
import org.junit.After; 
import org.junit.Assert; 
import org.junit.Before; 
import org.junit.Test; 

import java.io.IOException; 

public class SpanNearQueryTest { 

    private RAMDirectory directory = null; 

    private static final String BAZ = "baz"; 
    private static final String BAR = "bar"; 
    private static final String FOO = "foo"; 
    private static final String TERM_FIELD = "text"; 

    @Before 
    public void given() throws IOException { 
     directory = new RAMDirectory(); 
     IndexWriter writer = new IndexWriter(
       directory, 
       new StandardAnalyzer(Version.LUCENE_29), 
       IndexWriter.MaxFieldLength.UNLIMITED); 

     Document doc = new Document(); 
     doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED)); 
     doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED)); 
     doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED)); 

     writer.addDocument(doc); 
     writer.commit(); 
     writer.optimize(); 
     writer.close(); 
    } 

    @After 
    public void cleanup() { 
     directory.close(); 
    } 

    @Test 
    public void givenSingleMatch() throws IOException { 

     SpanNearQuery spanNearQuery = new SpanNearQuery(
       new SpanQuery[] { 
         new SpanTermQuery(new Term(TERM_FIELD, FOO)) 
       }, Integer.MAX_VALUE, false); 

     TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100); 

     Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length); 
    } 

    @Test 
    public void givenTwoMatches() throws IOException { 

     SpanNearQuery spanNearQuery = new SpanNearQuery(
       new SpanQuery[] { 
         new SpanTermQuery(new Term(TERM_FIELD, FOO)), 
         new SpanTermQuery(new Term(TERM_FIELD, BAR)) 
       }, Integer.MAX_VALUE, false); 

     TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100); 

     Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length); 
    } 

    @Test 
    public void givenThreeMatches() throws IOException { 

     SpanNearQuery spanNearQuery = new SpanNearQuery(
       new SpanQuery[] { 
         new SpanTermQuery(new Term(TERM_FIELD, FOO)), 
         new SpanTermQuery(new Term(TERM_FIELD, BAR)), 
         new SpanTermQuery(new Term(TERM_FIELD, BAZ)) 
       }, Integer.MAX_VALUE, false); 

     TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100); 

     Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length); 
    } 

    @Test 
    public void givenSingleMatch_andExtraTerm() throws IOException { 

     SpanNearQuery spanNearQuery = new SpanNearQuery(
       new SpanQuery[] { 
         new SpanTermQuery(new Term(TERM_FIELD, BAZ)), 
         new SpanTermQuery(new Term(TERM_FIELD, "EXTRA")) 
       }, 
       Integer.MAX_VALUE, false); 

     TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100); 

     Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length); 
    } 
} 
+0

注意:所有令牌都在单个字段中。感谢丹本指出缺少的信息。 – 2010-01-07 22:13:01

回答

5

SpanNearQuery可让您找到彼此之间的距离在一定范围内的术语。

例(从http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):

说,我们要内道格的5个 位置找到Lucene的,具有以下 Lucene的(为了事项)道格 - 您可以使用 以下SpanQuery:

new SpanNearQuery(new SpanQuery[] { 
    new SpanTermQuery(new Term(FIELD, "lucene")), 
    new SpanTermQuery(new Term(FIELD, "doug"))}, 
    5, 
    true); 

alt text http://www.lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png

在这个示例文本,Lucene是内 3道格

,但对你的榜样,我能看到的唯一的比赛是你的查询和目标文件都有“CD”(我想提出假设所有这些术语都在单个字段中)。在这种情况下,您不需要使用任何特殊的查询类型。使用标准机制,您将获得一些非零权重,这是基于它们在同一个字段中包含相同术语的事实。

编辑3 - 响应最新的评论,答案是,你不能使用SpanNearQuery做比其预定的,这是找出是否一个文档中的多个条款以外的任何内一定会出现彼此的地点数目。我无法确定您的具体用例/预期结果(随意发布),但在最后一种情况下,如果您只想知道是否存在(“BAZ”,“EXTRA”)中的一个或多个该文件,BooleanQuery将工作得很好。

编辑4 - 现在你已经发布了你的用例,我明白你想要做什么。以下是您可以这样做的方法:使用上述的BooleanQuery组合您想要的个人词汇以及SpanNearQuery,并在SpanNearQuery上设置一个提升。

因此,以文本形式查询看起来像:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5 

(作为一个例子 - 这将匹配包含任何“BAZ”或“EXTRA”的所有文档,而是一个更高的分数分配给文件,其中术语“BAZ”和“EXTRA”出现在彼此的100个位置之内;根据你的喜好调整位置并提升,这个例子来自Solr食谱,因此它可能不会在Lucene中解析,或者可能导致不希望的结果。在下一节中,我将向您展示如何使用API​​来构建它)。

以编程方式,您将构造如下:

Query top = new BooleanQuery(); 

// Construct the terms since they will be used more than once 
Term bazTerm = new Term("Field", "BAZ"); 
Term extraTerm = new Term("Field", "EXTRA"); 

// Add each term as "should" since we want a partial match 
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD); 
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD); 

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only 
// if BAZ and EXTRA occur within 100 places of each other. The final parameter means 
// that BAZ must occur before EXTRA. 
SpanNearQuery spanQuery = new SpanNearQuery(
           new SpanQuery[] { new SpanTermQuery(bazTerm), 
               new SpanTermQuery(extraTerm) }, 
           100, true); 

// Give it a boost of 5 since it is more important that the words are together 
spanQuery.setBoost(5f); 

// Add it as "should" since we want a match even when we don't have proximity 
top.add(spanQuery, BooleanClause.Occur.SHOULD); 

希望有帮助!在未来,试着首先发布你期望得到的结果 - 即使对你来说很明显,它可能不会给读者,并且明确地说可以避免必须来回多次。

+0

解释距离的在线图像是一个很好的接触。 – Brian 2010-01-07 20:56:57

+0

这就是我最初的想法。但是,相关文档不会从我的搜索中返回。 – 2010-01-07 22:11:37

+0

也许你可以发布一些代码来展示你如何搜索? – danben 2010-01-07 22:40:13