为什么索引器不搜索波斯文件？

我使用lucene 3来索引一些像这样的txt文件。为什么索引器不搜索波斯文件？

public static void main(String[] args) throws Exception { 

    String indexDir = "file input"; 
    String dataDir = "file input"; 
    long start = System.currentTimeMillis(); 

    indexer indexer = new indexer(indexDir); 
    int numIndexed, cnt; 
    try { 
     numIndexed = indexer.index(dataDir, new TextFilesFilter()); 

     cnt = indexer.getHitCount("mycontents", "شهردار"); 
     System.out.println("count of search in contents: " + cnt); 
    } finally { 
     indexer.close(); 
    } 
    long end = System.currentTimeMillis(); 
    System.out.println("Indexing " + numIndexed + " files took " 
      + (end - start) + " milliseconds"); 

}

getHitCount函数返回英文单词的点击次数，但通过波斯语单词返回零！

public int getHitCount(String fieldName, String searchString) 
     throws IOException, ParseException { 

    IndexSearcher searcher = new IndexSearcher(directory); 

    Term t = new Term(fieldName, searchString); 
    Query query = new TermQuery(t); 

    int hitCount = searcher.search(query, 1).totalHits; 
    searcher.close(); 
    return hitCount; 
}

如何在我的项目中设置utf-8？我使用netbeans并创建一个简单的java项目。我只需要一个简单的文件搜索！

这是我的索引类：

private IndexWriter writer; 
private Directory directory; 

public indexer(String indexDir) throws IOException { 
    directory = FSDirectory.open(new File(indexDir)); 
    writer = new IndexWriter(directory, 
      new StandardAnalyzer(
        Version.LUCENE_30), 
      true, 
      IndexWriter.MaxFieldLength.UNLIMITED); 
} 

public void close() throws IOException { 
    writer.close(); 
} 

public int index(String dataDir, FileFilter filter) 
     throws Exception { 
    File[] files = new File(dataDir).listFiles(); 
    for (File f : files) { 
     if (!f.isDirectory() 
       && !f.isHidden() 
       && f.exists() 
       && f.canRead() 
       && (filter == null || filter.accept(f))) { 
      indexFile(f); 
     } 
    } 
    return writer.numDocs(); 
} 

private static class TextFilesFilter implements FileFilter { 

    public boolean accept(File path) { 
     return path.getName().toLowerCase() 
       .endsWith(".txt"); 
    } 
} 

protected Document getDocument(File f) throws Exception { 
    Document doc = new Document(); 
    doc.add(new Field("mycontents", new FileReader(f))); 
    doc.add(new Field("filename", f.getName(), 
      Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    doc.add(new Field("fullpath", f.getCanonicalPath(), 
      Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    return doc; 
} 

private void indexFile(File f) throws Exception { 
    System.out.println("Indexing " + f.getCanonicalPath()); 
    Document doc = getDocument(f); 
    writer.addDocument(doc); 
}

来源

2016-02-05 NASRIN

我们可以看到你的索引类？这似乎是你自己实施的东西 – Niklas

@Niklas我编辑了我的问题。 – NASRIN

这会帮助你：http://stackoverflow.com/questions/23030329/lucene-encoding-java – Niklas

我怀疑，这个问题是不是Lucene的编码本身，而是FileReader。从FileReader文档：

此类的构造函数假定默认字符编码和默认字节缓冲区大小是适当的。

默认的字符编码可能是不恰当的，在这种情况下。

相反的：

doc.add(new Field("mycontents", new FileReader(f)));

尝试（假设要建立索引的文件是UTF-8编码）：

doc.add(new Field("mycontents", new InputStreamReader(new FileInputStream(f), "UTF8")));

来源

2016-02-05 17:03:22 femtoRgon

为什么索引器不搜索波斯文件？

回答

相关问题