将OCR的非结构化文本转换为正确的文本

我正在使用Microsoft MODI的VB6来OCR图像。（我知道其他OCR工具，如正方体等，但我发现MODI比其他更准确）将OCR的非结构化文本转换为正确的文本

的图像进行光学字符识别是这个样子

enter image description here

和，文中我得到的OCR是后像下面那样

Text1 
Text2 
Text3 
Number1 
Number2 
Number3

这里的问题是，对面栏的相应文本没有保留。如何将Number1与Text1映射？

我只能想到这样的解决方案。

MODI提供的所有OCR化的词坐标这样

LeftPos = Img.Layout.Words(0).Rects(0).Left 
TopPos = Img.Layout.Words(0).Rects(0).Top

所以要对齐同一行的话，我们可以匹配每个单词的TopPos然后LeftPos排序。我们将获得完整的产品线。所以我循环遍历所有单词，并将它们的文本以及左和顶部存储在一个mysql表中。然后运行此查询

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ') 
FROM test_copy 
GROUP BY `top`

我的问题是，这顶位置不是每个字完全一样，显然会有几个像素的差异。

我尝试添加DIV 5，用于合并5像素范围内但不适用于某些情况的单词。我也尝试过在node.js中通过计算每个单词的宽容然后通过LeftPos排序，但我仍然觉得这不是最好的方法。

更新： js代码完成这项工作，但除了Number1有5个像素差异并且Text2在该行中没有对应的情况。

有没有更好的想法做到这一点？

来源

2014-02-26 Салман

'Text1'和'Number1'是否总是存在（没有间隙或缺失值）？ OCR软件是否以任何顺序将“Words”放在首位？ – tcarvin

不，任何东西都可以在那里，空白，特殊的字符等等，一旦这些单词排成一行，我有其他的逻辑来解析出有意义的信息。我不确定订单的情况，但是当我们通过LeftPos对其进行分类时，无论如何都无关紧要。问题出在TopPos上：前4-6的词（考虑到3的容忍度）应放在同一行。感谢您阅读整个问题:)。 –

我不是100％确定如何识别那些位于“左”栏中的单词，但是一旦识别出该单词，就可以通过投影不仅仅是顶部坐标而是通过投影整个矩形（顶部和底部）。确定与其他单词的重叠（相交）。请注意下面以红色标记的区域。

Horizontal projection

这是你可以用它来检测，如果事情是在同一直线上的耐受性。如果一些东西只与一个像素重叠，那么它可能来自较低或较高的线。但是，如果它与50％或更高的Text1重叠，那么它可能在同一行上。

例SQL找到所有词语的基于顶上“线”和底部坐标

select 
    word.id, word.Top, word.Left, word.Right, word.Bottom 
from 
    word 
where 
    (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom) 
    or (word.Bottom >= @leftColWordTop and word.Bottom <= @leftColWordBottom)

实施例的伪代码VB6计算线条。

'assume words is a collection of WordInfo objects with an Id, Top, 
' Left, Bottom, Right properties filled in, and a LineAnchorWordId 
' property that has not been set yet. 

'get the words in left-to-right order 
wordsLeftToRight = SortLeftToRight(words) 

'also get the words in top-to-bottom order 
wordsTopToBottom = SortTopToBottom(words) 

'pass through identifying a line "anchor", that being the left-most 
' word that starts (and defines) a line 
for each anchorWord in wordsLeftToRight 

    'check if the word has been mapped to aline yet by checking if 
    ' its anchor property has been set yet. This assumes 0 is not 
    ' a valid id, use -1 instead if needed 
    if anchorWord.LineAnchorWordId = 0 then 

     'not locate every word on this line, as bounded by the 
     ' anchorWord. every word determined to be on this line 
     ' gets its LineAnchorWordId property set to the Id of the 
     ' anchorWord 
     for each lineWord in wordsTopToBottom 

      if lineWord.Bottom < anchorWord.Top Then 

       'skip it,it is above the line (but keep searching down 
       ' because we haven't reached the anchorWord location yet) 

      else if lineWord.Top > anchorWord.Bottom Then 

       'skip it,it is below the line, and exit the search 
       ' early since all the rest will also be below the line 
       exit for 

      else if OverlapsWithinTolerance(anchorWord, lineWord) then 

       lineWord.LineAnchorWordId = anchorWord.Id 

      endif 

     next 

    end if 

next anchorWord 

'at this point, every word has been assigned a LineAnchorWordId, 
' and every word on the same line will have a matching LineAnchorWordId 
' value. If stored in a DB you can now group them by LineAnchorWordId 
' and sort them by their Left coord to get your output.

来源

2014-02-26 13:28:27 tcarvin

我理解这个概念，并且我也有所有坐标用于投影矩形，但是我怎样才能做到逻辑上？我的意思是我所能得到的只是他们的X和Y的字。发现单词之间的重叠会太慢，我认为。 –

你可以在代码或数据库中做到这一点。我不知道你的数据库，但看看上面的编辑。 – tcarvin

添加了另一个代码示例。 – tcarvin

将OCR的非结构化文本转换为正确的文本

回答

相关问题