2012-02-28 56 views
0

我在Lucene_35的字段中搜索。我想知道我的术语中有多少单词与该领域相匹配。 例如,我的领域是“JavaServer Faces(JSF)是一个基于Java的Web应用程序框架,旨在简化基于Web的用户界面的开发集成。”,我的查询词是的 “java/JSF /框架/ doesnotexist”和希望导致3,因为只有的 “java”“JSF”“框架”出现在现场。 这里是我下面一个简单的例子:Lucene计数匹配项

public void explain(String document, String queryExpr) throws Exception { 

     StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35); 
     Directory index = new RAMDirectory(); 
     IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer); 
     IndexWriter w = new IndexWriter(index, config); 
     addDoc(w, document); 
     w.close(); 
     String queryExpression = queryExpr; 
     Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(queryExpression); 

     System.out.println("Query: " + queryExpression); 
     IndexReader reader = IndexReader.open(index); 
     IndexSearcher searcher = new IndexSearcher(reader); 
     TopDocs topDocs = searcher.search(q, 10); 
     for (int i = 0; i < topDocs.totalHits; i++) { 
      ScoreDoc match = topDocs.scoreDocs[i]; 
      System.out.println("match.score: " + match.score); 
      Explanation explanation = searcher.explain(q, match.doc); //#1 
      System.out.println("----------"); 
      Document doc = searcher.doc(match.doc); 
      System.out.println(doc.get("title")); 
      System.out.println(explanation.toString()); 
     } 
     searcher.close(); 
    } 

与上述参数的输出是:

0.021505041 = (MATCH) product of: 
    0.028673388 = (MATCH) sum of: 
    0.0064956956 = (MATCH) weight(title:java in 0), product of: 
     0.2709602 = queryWeight(title:java), product of: 
     0.30685282 = idf(docFreq=1, maxDocs=1) 
     0.8830299 = queryNorm 

....

 0.033902764 = (MATCH) fieldWeight(title:framework in 0), product of: 
     1.4142135 = tf(termFreq(title:framework)=2) 
     0.30685282 = idf(docFreq=1, maxDocs=1) 
     0.078125 = fieldNorm(field=title, doc=0) 
    0.75 = coord(3/4) 

我想这3/4作为结果。

问候!

+0

它与Lucene有什么关系? – jpountz 2012-02-28 17:00:26

+0

对不起jpountz,你是什么意思?我正在使用LUCENE_35和RAMDirectory索引。现在我意识到,有一个协调因素,它给了我确切的需要,但不知道如何得到这个协调因子。 – 2012-02-28 19:14:07

+0

你的问题没有提到Lucene,所以我不确定你的问题与Lucene有什么关系。你可以通过更多的细节来修改你的问题吗?你的索引结构如何?你希望你的文件根据比赛数量进行排序吗? – jpountz 2012-02-28 19:26:17

回答

7

您可以通过覆盖Lucene的DefaultSimilarity用下面的方法定义,实现这一目标: - > state.getBoost()

  • TF(频率) -

    • computeNorm(场,状态)>频率== 0 ? 0:1个
    • IDF(docFreq,numDocs) - > 1
    • 坐标(重叠,maxOverlap) - > 1/maxOverlap
    • queryNorm(sumOfQuareWeights) - > 1

    以这种方式,最终文档的分数结尾是匹配因子(1/maxOverlap)乘以匹配项的数量。

    Directory dir = new RAMDirectory(); 
    
    Similarity similarity = new DefaultSimilarity() { 
        @Override 
        public float computeNorm(String fld, FieldInvertState state) { 
        return state.getBoost(); 
        } 
    
        @Override 
        public float coord(int overlap, int maxOverlap) { 
        return 1f/maxOverlap; 
        } 
    
        @Override 
        public float idf(int docFreq, int numDocs) { 
        return 1f; 
        } 
    
        @Override 
        public float queryNorm(float sumOfSquaredWeights) { 
        return 1f; 
        } 
    
        @Override 
        public float tf(float freq) { 
        return freq == 0f ? 0f : 1f; 
        } 
    }; 
    IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_35, 
        new WhitespaceAnalyzer(Version.LUCENE_35)); 
    iwConf.setSimilarity(similarity); 
    IndexWriter iw = new IndexWriter(dir, iwConf); 
    Document doc = new Document(); 
    Field field = new Field("text", "", Store.YES, Index.ANALYZED); 
    doc.add(field); 
    for (String value : Arrays.asList("a b c", "c d", "a b d", "a c d")) { 
        field.setValue(value); 
        iw.addDocument(doc); 
    } 
    iw.commit(); 
    iw.close(); 
    
    IndexReader ir = IndexReader.open(dir); 
    IndexSearcher searcher = new IndexSearcher(ir); 
    searcher.setSimilarity(similarity); 
    BooleanQuery q = new BooleanQuery(); 
    q.add(new TermQuery(new Term("text", "a")), Occur.SHOULD); 
    q.add(new TermQuery(new Term("text", "b")), Occur.SHOULD); 
    q.add(new TermQuery(new Term("text", "d")), Occur.SHOULD); 
    
    TopDocs topDocs = searcher.search(q, 100); 
    System.out.println(topDocs.totalHits + " results"); 
    ScoreDoc[] scoreDocs = topDocs.scoreDocs; 
    for (int i = 0; i < scoreDocs.length; ++i) { 
        int docId = scoreDocs[i].doc; 
        float score = scoreDocs[i].score; 
        System.out.println(ir.document(docId).get("text") + " -> " + score); 
        System.out.println(searcher.explain(q, docId)); 
    } 
    ir.close(); 
    
  • +0

    非常感谢jpountz!你能告诉我如何在我的情况下检索结果。这是我与Lucine的第一天,很抱歉:) – 2012-02-28 21:09:30

    +0

    嗨Toss,我更新了我的答案和更多细节。 – jpountz 2012-03-01 09:59:40

    +0

    谢谢,jpountz! – 2012-03-07 21:24:13