字符串中名称匹配的相似

我想确定两个不同的餐厅名称是否能够匹配它们。名称可能拼写错误或标题的部分可能按错误顺序排列。字符串中名称匹配的相似

在某些情况下，它是一个简单的匹配： “愤怒的晚餐”与“愤怒的晚餐的餐厅”。或 “汉堡王”与“Burgor王”

较硬的情况下，我发现是： “马蒂亚斯达尔格伦Matbaren”和“Restaurant Mathias Dahlgren餐厅”

我也考虑过多种不同的模糊串差算法，但没有找到这个用例。

任何知道我可以使用的算法和/或库的人？

2015-04-06 Heinrisch

根据你想要做什么，你的问题可以看作是以下问题的副本http://stackoverflow.com/questions/29321760/how-to-check-a-partial-similarity-of-二串式-C-锋利/ 29322466？noredirect = 1＃comment46836018_29322466 – Codor

我已经看过并尝试了Levenshtein距离，但是当单词已经被拖动时，它不能很好地工作。 – Heinrisch

您的意思是说，例如“Burgor King”和“King Burger”之间的距离应该小于Levenshtein距离？ – Codor

您可以尝试diff算法。它创建所有可能的字符串并找到最长的公共子序列。

Well, as mentioned above the speed is O(N^3), i've done a longest common subsequence way that is O(m.n) where m and n are the length of str1 and str2, the result is a percentage and it seems to be exactly the same as similar_text percentage but with better performance... here's the 3 functions i'm using.. 

<?php 
function LCS_Length($s1, $s2) 
{ 
    $m = strlen($s1); 
    $n = strlen($s2); 

    //this table will be used to compute the LCS-Length, only 128 chars per string are considered 
    $LCS_Length_Table = array(array(128),array(128)); 


    //reset the 2 cols in the table 
    for($i=1; $i < $m; $i++) $LCS_Length_Table[$i][0]=0; 
    for($j=0; $j < $n; $j++) $LCS_Length_Table[0][$j]=0; 

    for ($i=1; $i <= $m; $i++) { 
    for ($j=1; $j <= $n; $j++) { 
     if ($s1[$i-1]==$s2[$j-1]) 
     $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j-1] + 1; 
     else if ($LCS_Length_Table[$i-1][$j] >= $LCS_Length_Table[$i][$j-1]) 
     $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j]; 
     else 
     $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i][$j-1]; 
    } 
    } 
    return $LCS_Length_Table[$m][$n]; 
} 

function str_lcsfix($s) 
{ 
    $s = str_replace(" ","",$s); 
    $s = ereg_replace("[��������]","e", $s); 
    $s = ereg_replace("[������������]","a", $s); 
    $s = ereg_replace("[��������]","i", $s); 
    $s = ereg_replace("[���������]","o", $s); 
    $s = ereg_replace("[��������]","u", $s); 
    $s = ereg_replace("[�]","c", $s); 
    return $s; 
} 

function get_lcs($s1, $s2) 
{ 
    //ok, now replace all spaces with nothing 
    $s1 = strtolower(str_lcsfix($s1)); 
    $s2 = strtolower(str_lcsfix($s2)); 

    $lcs = LCS_Length($s1,$s2); //longest common sub sequence 

    $ms = (strlen($s1) + strlen($s2))/2; 

    return (($lcs*100)/$ms); 
} 
?> 

you can skip calling str_lcsfix if you don't worry about accentuated characters and things like that or you can add up to it or modify it for faster performance, i think ereg is not the fastest way? 
hope this helps. 
Georges

[1] http://php.net/manual/de/function.similar-text.php

来源

2015-04-06 18:20:44 Bytemain

我认为最好的拟合算法将是最佳局部比对算法：Smith-Waterman-Algorithm

penalty("Angry Diner","Angry Diner Restaurant") = 0 
penalty("Burger King", "Burgor King") = 1 
penalty("Mathias Dahlgren Matbaren", "Restaurant Mathias Dahlgren") = 0

它是一种Levensthein算法的变体，不同之处在于在开始/结束时插入/删除字符不会受到惩罚d。

来源

2015-04-08 04:58:14 CoronA

首先：你将得到更好的结果，如果你有不只是名字更加匹配，如地址。然后，您可以使用记录链接引擎来考虑来自所有属性的证据。在大多数情况下，使用这个名字会导致精度不高。

你需要考虑的第一件事是，如果你很可能看到子的重新排序。那就是“餐厅生气晚餐”vs“愤怒的晚餐餐厅”。在这种情况下，q-gram，最长的公共子串和最长的公共子串都是很好的候选。对于q克，你可以选择各种子公式和配对。

如果你想以无所谓，仿射差距很可能会好这个特殊的。它与史密斯沃特曼相似，但不会因为缺失而造成太大的惩罚。基本上，第一次删除是昂贵的，但后来在同一地点删除便宜。

正如其他人的建议，删除，如“餐厅”常用词“ matbaren”等之前的匹配可能在提高精度。

有一堆库，但由于您没有指定编程语言，所以很难推荐一种库。如果您使用PHP，Java有什么用处？或相反亦然？

但请仔细注意我上面写道：单独的名字是不会很好地工作。即使名称相同，它仍然可能是两个完全不同的餐厅。

来源

2015-04-15 16:12:43 larsga

字符串中名称匹配的相似

回答

相关问题