在Lucene中结合分析器的最佳实践是什么？

我有我使用的是StandardAnalyzer在Lucene的地方索引的文本字符串如下的情况：哪些工作得很好在Lucene中结合分析器的最佳实践是什么？

public void indexText(String suffix, boolean includeStopWords) {   
    StandardAnalyzer analyzer = null; 


    if (includeStopWords) { 
     analyzer = new StandardAnalyzer(Version.LUCENE_30); 
    } 
    else { 

     // Get Stop_Words to exclude them. 
     Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();  
     analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords); 
    } 

    try { 

     // Index text. 
     Directory index = new RAMDirectory(); 
     IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);    
     this.addTextToIndex(w, this.getTextToIndex()); 
     w.close(); 

     // Read index. 
     IndexReader ir = IndexReader.open(index); 
     Text_TermVectorMapper ttvm = new Text_TermVectorMapper(); 

     int docId = 0; 

     ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm); 

     // Set output. 
     this.setWordFrequencies(ttvm.getWordFrequencies()); 
     w.close(); 
    } 
    catch(Exception ex) { 
     logger.error("Error message\n", ex); 
    } 
} 

private void addTextToIndex(IndexWriter w, String value) throws IOException { 
    Document doc = new Document(); 
    doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES)); 
    w.addDocument(doc); 
}

，但我想这与使用SnowballAnalyzer以及所产生结合。

此类还具有以下构造出两个实例变量：

public Text_Indexer(String textToIndex) { 
    this.textToIndex = textToIndex; 
    this.wordFrequencies = new HashMap<String, Integer>(); 
}

谁能告诉我如何最好与上面的代码来实现这一目标？

谢谢

摩根先生。

来源

2011-03-05 mr morgan

Lucene提供org.apache.lucene.analysis.Analyzer基类，如果您想编写自己的分析器，可以使用它。
您可以检出org.apache.lucene.analysis.standard.StandardAnalyzer扩展Analyzer的类。

然后，在YourAnalyzer，你会链StandardAnalyzer和SnowballAnalyzer通过使用这些分析仪使用的过滤器，像这样：

TokenStream result = new StandardFilter(tokenStream); 
result = new SnowballFilter(result, stopSet);

然后，在你现有的代码，你就能够建立的IndexWriter与您自己的分析器实施，链接标准和雪球过滤器。

完全题外话：
我想你最终需要设置的处理请求您的自定义方式。这已经在Solr内部实施。

首先通过扩展的SearchComponent和定义它solrconfig.xml中，像这样写你自己的搜索组件：

<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>

然后写你的搜索处理程序（请求处理）通过扩展SearchHandler，并在SolrConfig定义它。 XML：

<requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true"> 
    <!-- default values for query parameters --> 
     <lst name="defaults"> 
      <str name="echoParams">explicit</str>  
      <int name="rows">1000</int> 
      <str name="fl">*</str> 
      <str name="version">2.1</str> 
     </lst> 

     <arr name="components"> 
      <str>yourQueryComponent</str> 
      <str>facet</str> 
      <str>mlt</str> 
      <str>highlight</str>    
      <str>stats</str> 
      <str>debug</str> 

     </arr> 

    </requestHandler>

然后，当你发送网址查询到Solr，只是包括额外的参数QT = YourRequestHandlerName，这将导致您的请求处理程序被用于该请求。

More about SearchComponents.
More about RequestHandlers.

来源

2011-03-09 16:41:23

由Lucene的提供的SnowballAnalyzer已经使用了StandardTokenizer，StandardFilter，LowerCaseFilter，的StopFilter和SnowballFilter。所以它听起来像是你想要的东西（StandardAnalyzer所做的一切，加上雪球的起源）。

如果没有，你可以通过组合你想要的任何标记器和TokenStream来轻松地构建你自己的分析器。

来源

2011-03-10 12:37:25 Avi

最后，我重新安排了程序代码来调用SnowBallAnalyzer作为选项。然后通过StandardAnalyzer对输出进行索引。

它的工作原理和速度很快，但如果我只用一台分析仪即可完成所有工作，我将重新访问我的代码。

感谢mbonaci和Avi。

来源

2011-03-12 10:31:07

在Lucene中结合分析器的最佳实践是什么？

回答

相关问题