2017-02-19 182 views
1

TL; DR:如何以不包含相邻(顶部和bot)段落的方式选择图片上的段落?如何在Python/OpenCV中将图像分割为干净的段落?

我有一组扫描图像,它们是文本的单列,如this one。这些图像全部是黑色和白色,已经旋转,它们的噪点减少并且具有修剪的白色空间。

我想要做的是将每个这样的图像分成段落。我最初的想法是测量每行的平均亮度以找出文本行之间的空格,并尝试从该行开始选择与缩进匹配的矩形并测量该矩形的亮度。但这似乎有点麻烦。此外,这些线有时会稍微倾斜(在极限端达到大约10 px的垂直差异),所以有时会出现线重叠。所以我想选择一个段落的所有字母,并用它们来绘制一段文字,我得到this使用this method,但我不知道如何进一步进行。选择从左边开始的n像素的每个字母矩形,并尝试包括从不小于first_rectangle_x - offset开始的每个矩形?但是呢?

回答

2

这是特定于所附的段落结构。我不知道你是否需要一个更通用的解决方案,但它可能会需要额外的工作:

import cv2 
import numpy as np 
import matplotlib.pyplot as plt 

image = cv2.imread('paragraphs.png', 0) 

# find lines by horizontally blurring the image and thresholding 
blur = cv2.blur(image, (91,9)) 
b_mean = np.mean(blur, axis=1)/256 

# hist, bin_edges = np.histogram(b_mean, bins=100) 
# threshold = bin_edges[66] 
threshold = np.percentile(b_mean, 66) 
t = b_mean > threshold 
''' 
get the image row numbers that has text (non zero) 
a text line is a consecutive group of image rows that 
are above the threshold and are defined by the first and 
last row numbers 
''' 
tix = np.where(1-t) 
tix = tix[0] 
lines = [] 
start_ix = tix[0] 
for ix in range(1, tix.shape[0]-1): 
    if tix[ix] == tix[ix-1] + 1: 
     continue 
    # identified gap between lines, close previous line and start a new one 
    end_ix = tix[ix-1] 
    lines.append([start_ix, end_ix]) 
    start_ix = tix[ix] 
end_ix = tix[-1] 
lines.append([start_ix, end_ix]) 

l_starts = [] 
for line in lines: 
    center_y = int((line[0] + line[1])/2) 
    xx = 500 
    for x in range(0,500): 
     col = image[line[0]:line[1], x] 
     if np.min(col) < 64: 
      xx = x 
      break 
    l_starts.append(xx) 

median_ls = np.median(l_starts) 

paragraphs = [] 
p_start = lines[0][0] 

for ix in range(1, len(lines)): 
    if l_starts[ix] > median_ls * 2: 
     p_end = lines[ix][0] - 10 
     paragraphs.append([p_start, p_end]) 
     p_start = lines[ix][0] 

p_img = np.array(image) 
n_cols = p_img.shape[1] 
for paragraph in paragraphs: 
    cv2.rectangle(p_img, (5, paragraph[0]), (n_cols - 5, paragraph[1]), (128, 128, 0), 5) 

cv2.imwrite('paragraphs_out.png', p_img) 

输入/输出

enter image description here

+0

谢谢,这工作得很好对于大多数图像,有是例外:http://imgur.com/a/z0836。所以的确,我会做一些修补,但没关系:) – MrVocabulary

+0

但是,你能向我解释代码的前几行是什么吗?我很难理解你在那里用直方图做了什么。 – MrVocabulary

+1

当然,我会添加评论。直方图用于可视化并留在那里。您可以改用百分位数 –