2016-09-27 91 views
1

我对Python很新,我想用模糊wuzzy进行模糊匹配。我相信我使用partial_ratio函数获得不正确的匹配分数。这里是我的探索代码:从模糊wuzzy partial_ratio得到不正确的分数

>>>from fuzzywuzzy import fuzz 
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil') 
50 

我相信这应该返回100分,因为第二个字符串,“巴尔比尔”,包含在第一的字符串中。当我尝试起飞几个大字在年底或第一个字符串的开始,我得到的100

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil') 
100 
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa') 
100 

这似乎从一个分数的50切换到100分当长度匹配得分第一个字符串的数字是199.有没有人对可能发生的事情有所了解?

回答

0

这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher。 此代码应该适用于您:

from difflib import SequenceMatcher 

def partial_ratio(s1, s2): 
    """"Return the ratio of the most similar substring 
    as a number between 0 and 100.""" 

    if len(s1) <= len(s2): 
     shorter = s1 
     longer = s2 
    else: 
     shorter = s2 
     longer = s1 

    m = SequenceMatcher(None, shorter, longer, autojunk=False) 
    blocks = m.get_matching_blocks() 

    # each block represents a sequence of matching characters in a string 
    # of the form (idx_1, idx_2, len) 
    # the best partial match will block align with at least one of those blocks 
    # e.g. shorter = "abcd", longer = XXXbcdeEEE 
    # block = (1,3,3) 
    # best score === ratio("abcd", "Xbcd") 
    scores = [] 
    for (short_start, long_start, _) in blocks: 
     long_end = long_start + len(shorter) 
     long_substr = longer[long_start:long_end] 

     m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False) 
     r = m2.ratio() 
     if r > .995: 
      return 100 
     else: 
      scores.append(r) 

    return max(scores) * 100.0