是否可以遍历Lucene索引中存储的文档？

我有一些文档存储在一个docId字段的Lucene索引中。我想获取索引中存储的所有docIds。还有一个问题。文件数量约为300 000份，所以我宁愿将这些文件分成500份大小的文件。是否可以这样做？是否可以遍历Lucene索引中存储的文档？

来源

2010-02-22 Eugeniu Torica

IndexReader reader = // create IndexReader 
for (int i=0; i<reader.maxDoc(); i++) { 
    if (reader.isDeleted(i)) 
     continue; 

    Document doc = reader.document(i); 
    String docId = doc.get("docId"); 

    // do something with docId here... 
}

来源

2010-02-23 21:15:28 bajafresh4life

是什么发生，如果（reader.isDeleted（i））的缺失？ – 2010-02-24 16:16:36

如果没有执行isDeleted（）检查，您将输出以前删除的文档的ID – bajafresh4life 2010-02-25 03:34:51

要从上面完成评论。当索引重新打开时索引更改将被提交，因此reader.isDeleted（i）对于确保文档有效是必需的。 – 2011-02-24 11:29:05

文档编号（或ids）将是从0到IndexReader.maxDoc（） - 1的后续编号。这些数字不是持久的，只对打开的IndexReader有效。你可以检查文档是否与IndexReader.isDeleted（INT documentNumber）方法删除

来源

2010-02-22 19:09:38 Yaroslav

Lucene的4

Bits liveDocs = MultiFields.getLiveDocs(reader); 
for (int i=0; i<reader.maxDoc(); i++) { 
    if (liveDocs != null && !liveDocs.get(i)) 
     continue; 

    Document doc = reader.document(i); 
}

此页的详细信息，

见LUCENE-2600：https://lucene.apache.org/core/4_0_0/MIGRATE.html

来源

2013-08-28 22:45:07 bcoughlan

这是由其他用户回滚，但原始编辑器是正确的，liveDocs可以为null – bcoughlan 2013-11-01 15:24:49

如果您使用.document（i），如上面的示例中所示，并跳过删除的文档，请小心如果您使用此方法对结果进行分页。即：您有10个文档/每个页面列表，您需要获取文档。对于第6页。您的输入可能是这样的：offset = 60，count = 10（文档从60到70）。

IndexReader reader = // create IndexReader 
for (int i=offset; i<offset + 10; i++) { 
    if (reader.isDeleted(i)) 
     continue; 

    Document doc = reader.document(i); 
    String docId = doc.get("docId"); 
}

你将有一些问题，删除的文件，因为你不应该从开始偏移量= 60，但是从偏移量= 60 + 60之前

另一种我发现，出现删除文件的数量是这样的：

is = getIndexSearcher(); //new IndexSearcher(indexReader) 
    //get all results without any conditions attached. 
    Term term = new Term([[any mandatory field name]], "*"); 
    Query query = new WildcardQuery(term); 

    topCollector = TopScoreDocCollector.create([[int max hits to get]], true); 
    is.search(query, topCollector); 

    TopDocs topDocs = topCollector.topDocs(offset, count);

注意：用自己的值替换[[]]之间的文本。在大型指数上运行150万条记录，并在不到一秒的时间内得到随机的10条结果。同意速度较慢，但如果您需要分页，至少您可以忽略已删除的文档。

来源

2015-04-30 08:53:04 andreyro

还有查询类命名MatchAllDocsQuery，我认为它可以在这种情况下使用：

Query query = new MatchAllDocsQuery(); 
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);

来源

2016-01-21 08:05:01

是否可以遍历Lucene索引中存储的文档？

回答

相关问题