2014-02-06 31 views
1

我试图使用Levenshtein Distance的帮助来在OCR页面上查找模糊关键字(静态文本)。
要做到这一点,我想给出一个允许的错误百分比(比如15%)。模糊匹配字符串中的多个单词

string Keyword = "past due electric service"; 

由于关键字是25个字符长,我想允许4个错误(25 * 0.15四舍五入)
我需要能够比较它...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank 
          you! current electric service total balances unpaid 7 
          days after the total due date are subject to a late 
          charge of 7.5% of the amount due or $2.00, whichever/5 
          greater. " 

这是我怎么做,现在......

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202 
int NumberOfErrorsAllowed = 4; 
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205 

显然,Keyword没有在OCR_Text找到(它不应该)。但是,使用Levenshtein的距离,错误的数量少于15%的余地(因此我的逻辑表示它被发现)。

有谁知道更好的方法来做到这一点?

+0

发布了一个更好的问题。 http://goo.gl/Rb6ejp – Milne

回答

1

使用子字符串回答了我的问题。如果其他人遇到相同类型的问题,则发帖。有点非正统,但它对我很好。

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have. 
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum 
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search 
decimal StaticTextLength = (StaticText.Length); //Length of text to search for 
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance/100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage 

    //Look for best match with 1 less character than it should have, then the correct amount of characters. 
    //And last, with 1 more character. (This is because one letter can be recognized as 
    //two (W -> VV) and visa versa) 

for (int i = 0; i < 3; i++) 
{ 
    for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++) 
    { 
     string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer)); 
     int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero)); 
     int lNumber = LevenshteinAlgorithm(StaticText, possibleResult); 

     if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber))) 
     { 
      PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber }); 
      LowestLevenshteinNumber = lNumber; 
     } 
    } 
    TextLengthBuffer++; 
} 




public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm 
{ 
    int n = s.Length; 
    int m = t.Length; 
    int[,] d = new int[n + 1, m + 1]; 

    if (n == 0) 
    { 
     return m; 
    } 

    if (m == 0) 
    { 
     return n; 
    } 

    for (int i = 0; i <= n; d[i, 0] = i++) 
    { 
    } 

    for (int j = 0; j <= m; d[0, j] = j++) 
    { 
    } 

    for (int i = 1; i <= n; i++) 
    { 
     for (int j = 1; j <= m; j++) 
     { 
      int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; 

      d[i, j] = Math.Min(
       Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), 
       d[i - 1, j - 1] + cost); 
     } 
    } 
    return d[n, m]; 
} 
0

我认为它不工作,因为你的字符串的大块是匹配的。所以我会做的是尝试将你的关键词分成不同的单词。

然后在您的OCR_TEXT中找到所有匹配这些词的地方。

然后看看他们匹配的所有地方,看看这些地方中是否有4个地方是连续的,并且匹配原始短语。

我不确定我的解释是否清楚?

+0

如果我正确理解你的答案,我将失去声明NumberOfErrorsAllowed的能力。没有? – Milne

+0

是,否;这将是每个字。 –

+0

每个单词都不起作用。一个词可以是“我”,如果它被识别为“1”,我就会失去结果。看到我想出的答案。谢谢 – Milne