2010-05-30 70 views
0

为什么DuplicateFilter不能与其他过滤器一起使用?例如,如果测试DuplicateFilterTest的一点重拍,然后的印象是,过滤器没有被施加到其他的过滤器和第一修剪的结果:Lucene DuplicateFilter问题

public void testKeepsLastFilter() 
      throws Throwable { 
     DuplicateFilter df = new DuplicateFilter(KEY_FIELD); 
     df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE); 

     Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{ 
       new QueryWrapperFilter(tq), 
       // new QueryWrapperFilter(new TermQuery(new Term("text", "out"))), // works right, it is the last document. 
       new QueryWrapperFilter(new TermQuery(new Term("text", "now"))) // why it doesn't work? It is the third document, but hits count is 0. 

     }, ChainedFilter.AND)); 

     // this varians doesn't hit too: 
     // ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, df), new QueryWrapperFilter(new TermQuery(new Term("text", "now"))), 1000).scoreDocs; 
     // ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, new QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df, 1000).scoreDocs; 

     ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs; 

     assertTrue("Filtered searching should have found some matches", hits.length > 0); 
     for (int i = 0; i < hits.length; i++) { 
      Document d = searcher.doc(hits[i].doc); 
      String url = d.get(KEY_FIELD); 
      TermDocs td = reader.termDocs(new Term(KEY_FIELD, url)); 
      int lastDoc = 0; 
      while (td.next()) { 
       lastDoc = td.doc(); 
      } 
      assertEquals("Duplicate urls should return last doc", lastDoc, hits[i].doc); 
     } 
    } 

回答

2

DuplicateFilter 独立地构建其中选择第一或最后一个过滤器发生包含每个密钥的所有文件。这可以通过最小的内存开销来缓存。

你的第二个过滤器独立地选择一些其他文档。这两种选择可能不一致。根据所有文档的一些任意子集来过滤重复项可能需要使用字段缓存才能执行,这就是事物变得昂贵的原因RAM-wise