python tesseract结果在句子之间给出不必要的额外行间差距

我正在使用tesseract执行一些OCR操作。我已经为此写了一个简单的python包装器。问题是我在最后的文本文件中的句子之间出现不需要的行间距，我需要通过编程方式删除它们。例如：python tesseract结果在句子之间给出不必要的额外行间差距

1 tbsp peanut or corn oil, plus a little 
extra for Cooking the scallops 

2 tbsp bottled mild or medium Thai 
green curry paste 
2 tbsp water 

2 tsp light soy sauce

请注意一些行间距 - 我需要删除。如果您遇到类似问题，请分享一些提示。谢谢。

这里是包装：

from PIL import Image 
import subprocess 
import os 
from wand.image import Image 
import markdown2 
from textblob import TextBlob 

import util 
import errors 

tesseract_exe = "tesseract" # Name of executable to be called at command line 
scratch_text_name_root = "temp" # Leave out the .txt extension 
cleanup_scratch_flag = True # Temporary files cleaned up after OCR operation 
pagesegmode = "-psm 0" 


def call_tesseract(input_file, output_file): 
    args = [tesseract_exe, input_file, output_file, pagesegmode] 
    proc = subprocess.Popen(args) 
    retcode = proc.wait() 
    if retcode !=0: 
     errors.check_for_errors() 


def retrieve_text(scratch_text_name_root): 
    inf = file(scratch_text_name_root + '.txt') 
    text = inf.read() 
    inf.close() 
    return text 

def write_to_file(filename, string): 
    File = open(filename, 'w') 
    File.write(string) 
    File.close() 


def image_to_string(filename): 
    try: 
     call_tesseract(filename, scratch_text_name_root) 
     text = retrieve_text(scratch_text_name_root) 
    finally: 
     try: 
      os.remove(scratch_text_name_root) 
     except OSError: 
      pass 

     return text  

filename = "book/0001.bin.png" 
text = image_to_string(filename) 
print "writing to file" 
write_to_file("0002.bin.txt", text)

来源

2016-02-05 Anay Bose

林不知道为什么正方体给你这些空行，但也许一个简单的解决方法帮助你：

只是删除这些空行。有很多方法可以做到这一点，例如看这里：https://stackoverflow.com/a/3711884/4175009

或在这里：

https://stackoverflow.com/a/2369474/4175009

这些解决方案都假设你逐行读取文件中的行。

我喜欢这个solution，因为你可以在你的完成字符串中直接使用它，并处理行结尾（\ n，\ n \ r，\ r \ n）中的操作系统差异。

来源

2016-02-05 18:37:58 Entwicklerpages

感谢您的链接。优秀的建议。我投票给你。再次感谢您的时间。 –

python tesseract结果在句子之间给出不必要的额外行间差距

回答

相关问题