有人可以帮助我找到所有lucene索引中的词频：
例如，如果文档A有3个词（B），文档C有2个词，我想要一个方法，以返回图5是表示在所有Lucene索引词（B）的频率统计lucene索引中的词频

2010-11-12 Ehsan

你在看什么样的索引大小？取决于您可能想要使用Hadoop来做到这一点，或者使用简单的索引解析器来收集地图中的单词频率。 – anirvan 2010-11-12 18:23:06

这已被要求多次：

2010-11-12 19:47:40 Xodarap

假设你使用Lucene 3.x的工作：

IndexReader ir = IndexReader.open(dir); 
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word")); 
int count = 0; 
while (termDocs.next()) { 
    count += termDocs.freq(); 
}

一些评论：

dir是Lucene的Directory class的实例。 RAM和文件系统索引的创建方式不同，请参阅Lucene文档以获取详细信息。

"your_filed"是提交搜索一个术语。如果您有多个字段，则可以为所有这些字段运行过程，或者为索引文件编制索引时，可以创建特殊字段（例如“_content”）并在其中保留所有其他字段的串联值。

来源

2010-11-12 19:48:21 ffriend

非常'TermDocs'不在lucene 5.3.1中，我使用:( – 2016-11-24 19:02:00

使用Lucene 3.4

简单的方法来计数，但你需要两个数组： -/

int[] docs = new int[1000]; 
int[] freqs = new int[1000]; 
int count = indexReader.termDocs(term).read(docs, freqs);

要注意：如果你会用阅读你是不是能够接下来用（）因为read（）之后你已经在枚举的末尾：

int[] docs = new int[1000]; 
int[] freqs = new int[1000]; 
TermDocs td = indexReader.termDocs(term); 
int count = td.read(docs, freqs); 
while (td.next()){ // always false, already at the end of the enumartion 
}

来源

2013-07-17 11:12:27 Oliver

统计lucene索引中的词频

回答

使用Lucene 3.4

相关问题