使用来自Tika的Tesseract：结果仅包含换行符

我尝试使用Apache Tika和Tesseract for Windows解析包含扫描文本的PNG文件。使用来自Tika的Tesseract：结果仅包含换行符

虽然从命令行运行Tesseract确实能够正确识别文本，但Tika返回的内容仅包含换行符（“\ n”）。

这是我的代码：

ByteArrayInputStream inputstream = new ByteArrayInputStream(document.getFileContent()); 
byte[] content = document.getFileContent(); 
Parser parser = new AutoDetectParser(); 
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); //to process long files 
Metadata metadata = new Metadata(); 

ParseContext parseContext = new ParseContext(); 
TesseractOCRConfig config = new TesseractOCRConfig(); 
config.setTesseractPath("C:\\Program Files (x86)\\Tesseract-OCR"); 
config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"); 
config.setMaxFileSizeToOcr(Integer.MAX_VALUE); 
parseContext.set(TesseractOCRConfig.class, config); 
parseContext.set(Parser.class, parser); 

parser.parse(inputstream, handler, metadata, parseContext); 

String contentString = handler.toString(); 
System.out.println(contentString);

我试图调试，发现TesseractOCRParser.doOcr（）应该运行这样的一个过程中执行命令：

tesseract C:\Users\admin\AppData\Local\Temp\apache-tika-6655676641285964446.tmp C:\Users\admin\AppData\Local\Temp\apache-tika-2151149415666715558.tmp -l eng -psm 1 txt

但是，它看起来像进程不运行。如果我从另一个会话运行相同的命令，则会出现识别的内容。

来源

2017-03-08 Serge Iroshnikov

您是否尝试过[Tika此类问题的疑难解答指南]（https://wiki.apache.org/tika/Troubleshooting%20Tika#Wrong_Content_Extracted）？ – Gagravarr

我发现这个问题是在这条线：应该省略

config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");

这条线和解析器会找到正确的道路。

来源

2017-03-14 08:38:03

使用来自Tika的Tesseract：结果仅包含换行符

回答

相关问题