使用lucene的拼写检查程序

我正在尝试使用lucene拼写检查程序来编写拼写校正程序。我想给它一个包含博客文本内容的文本文件。问题在于，它只在我的字典文件中每行给出一个句子/字时才起作用。建议的API返回的结果没有给出任何重量级别的出现次数。以下是源代码使用lucene的拼写检查程序

public class SpellCorrector { 

     SpellChecker spellChecker = null; 

     public SpellCorrector() { 
       try { 
         File file = new File("/home/ubuntu/spellCheckIndex"); 
         Directory directory = FSDirectory.open(file); 

         spellChecker = new SpellChecker(directory); 

         StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
         IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); 
         spellChecker.indexDictionary(
             new PlainTextDictionary(new File("/home/ubuntu/main.dictionary")), config, true); 
                     //Should I format this file with one sentence/word per line? 

       } catch (IOException e) { 

       } 

     } 

     public String correct(String query) { 
       if (spellChecker != null) { 
         try { 
           String[] suggestions = spellChecker.suggestSimilar(query, 5); 
           // This returns the suggestion not based on occurence but based on when it occured 

           if (suggestions != null) { 
             if (suggestions.length != 0) { 
               return suggestions[0]; 
             } 
           } 
         } catch (IOException e) { 
           return null; 
         } 
       } 
       return null; 
     } 
}

我需要做一些更改吗？

来源

2013-03-15 Global Warrior

关于你的第一个问题，听起来像预期的，记录的字典格式，这里的PlainTextDictionary API。如果您想传入任意文本，您可能需要将其编入索引并使用LuceneDictionary，或者可能使用HighFrequencyDictionary，具体取决于您的需要。

拼写检查程序建议根据词之间的相似性（基于Levenstein Distance），在任何其他问题之前进行替换。如果您希望仅建议更多热门词汇，则应通过SuggestMode至SpellChecker.suggestSimilar。这确保了建议的匹配至少与他们打算取代的词一样强大，受欢迎。

如果您必须重写Lucene决定最佳匹配的方式，您可以使用SpellChecker.setComparator来做到这一点，在SuggestWord s上创建您自己的比较器。由于SuggestWord向您展示freq，因此应该很容易按照流行度排列找到的匹配项。

来源

2013-03-15 15:36:29 femtoRgon

使用lucene的拼写检查程序

回答

相关问题