运行vb代码计算相似度时定义首字母缩略词

我在Excel中使用以下vb代码来计算列A和列B之间的程度相似度。它运行得非常好。运行vb代码计算相似度时定义首字母缩略词

对我来说，下一步是定义首字母缩略词，以便计算出的相似度不受影响。 IE：如果我在A列“ABC有限责任公司”和B列“ABC有限责任公司”中，目前的VB代码将返回两列不很相似。然而，我希望他们通过定义“有限责任公司”和“有限责任公司”确实是一回事，将其归为100％类似。我可以做什么，我可以在代码中将它放在哪里来完成此任务？谢谢！

免责声明 - 是的我知道有这样做的插件。但是，我的数据集太大而无法使用它们。

Public Function Similarity(ByVal String1 As String, _ 
          ByVal String2 As String, _ 
          Optional ByRef RetMatch As String, _ 
          Optional min_match = 1) As Single 

'Returns percentile of similarity between 2 strings (ignores case) 

'"RetMatch" returns the characters that match(in order) 
'"min_match" specifies minimum number af char's in a row to match 


Dim b1() As Byte, b2() As Byte 
Dim lngLen1 As Long, lngLen2 As Long 
Dim lngResult As Long 

    If UCase(String1) = UCase(String2) Then  '..Exactly the same 
    Similarity = 1 

    Else           '..one string is empty 
    lngLen1 = Len(String1) 
    lngLen2 = Len(String2) 
    If (lngLen1 = 0) Or (lngLen2 = 0) Then 
     Similarity = 0 

    Else          '..otherwise find similarity 
     b1() = StrConv(UCase(String1), vbFromUnicode) 
     b2() = StrConv(UCase(String2), vbFromUnicode) 
     lngResult = Similarity_sub(0, lngLen1 - 1, _ 
           0, lngLen2 - 1, _ 
           b1, b2, _ 
           String1, _ 
           RetMatch, _ 
           min_match) 
     Erase b1 
     Erase b2 
     If lngLen1 >= lngLen2 Then 
     Similarity = lngResult/lngLen1 
     Else 
     Similarity = lngResult/lngLen2 
     End If 
    End If 
    End If 

End Function 

Private Function Similarity_sub(ByVal start1 As Long, ByVal end1 As Long, _ 
           ByVal start2 As Long, ByVal end2 As Long, _ 
           ByRef b1() As Byte, ByRef b2() As Byte, _ 
           ByVal FirstString As String, _ 
           ByRef RetMatch As String, _ 
           ByVal min_match As Long, _ 
           Optional recur_level As Integer = 0) As Long 
'* CALLED BY: Similarity * (RECURSIVE) 

Dim lngCurr1 As Long, lngCurr2 As Long 
Dim lngMatchAt1 As Long, lngMatchAt2 As Long 
Dim i As Long 
Dim lngLongestMatch As Long, lngLocalLongestMatch As Long 
Dim strRetMatch1 As String, strRetMatch2 As String 

    If (start1 > end1) Or (start1 < 0) Or (end1 - start1 + 1 < min_match) _ 
    Or (start2 > end2) Or (start2 < 0) Or (end2 - start2 + 1 < min_match) Then 
    Exit Function  '(exit if start/end is out of string, or length is too short) 
    End If 

    For lngCurr1 = start1 To end1  '(for each char of first string) 
    For lngCurr2 = start2 To end2  '(for each char of second string) 
     i = 0 
     Do Until b1(lngCurr1 + i) <> b2(lngCurr2 + i) 'as long as chars DO match.. 
     i = i + 1 
     If i > lngLongestMatch Then  '..if longer than previous best, store starts & length 
      lngMatchAt1 = lngCurr1 
      lngMatchAt2 = lngCurr2 
      lngLongestMatch = i 
     End If 
     If (lngCurr1 + i) > end1 Or (lngCurr2 + i) > end2 Then Exit Do 
     Loop 
    Next lngCurr2 
    Next lngCurr1 

    If lngLongestMatch < min_match Then Exit Function 'no matches at all, so no point checking for sub-matches! 

    lngLocalLongestMatch = lngLongestMatch     'call again for BEFORE + AFTER 
    RetMatch = "" 
           'Find longest match BEFORE the current position 
    lngLongestMatch = lngLongestMatch _ 
        + Similarity_sub(start1, lngMatchAt1 - 1, _ 
            start2, lngMatchAt2 - 1, _ 
            b1, b2, _ 
            FirstString, _ 
            strRetMatch1, _ 
            min_match, _ 
            recur_level + 1) 
    If strRetMatch1 <> "" Then 
    RetMatch = RetMatch & strRetMatch1 & "*" 
    Else 
    RetMatch = RetMatch & IIf(recur_level = 0 _ 
           And lngLocalLongestMatch > 0 _ 
           And (lngMatchAt1 > 1 Or lngMatchAt2 > 1) _ 
           , "*", "") 
    End If 

           'add local longest 
    RetMatch = RetMatch & Mid$(FirstString, lngMatchAt1 + 1, lngLocalLongestMatch) 

           'Find longest match AFTER the current position 
    lngLongestMatch = lngLongestMatch _ 
        + Similarity_sub(lngMatchAt1 + lngLocalLongestMatch, end1, _ 
            lngMatchAt2 + lngLocalLongestMatch, end2, _ 
            b1, b2, _ 
            FirstString, _ 
            strRetMatch2, _ 
            min_match, _ 
            recur_level + 1) 

    If strRetMatch2 <> "" Then 
    RetMatch = RetMatch & "*" & strRetMatch2 
    Else 
    RetMatch = RetMatch & IIf(recur_level = 0 _ 
           And lngLocalLongestMatch > 0 _ 
           And ((lngMatchAt1 + lngLocalLongestMatch < end1) _ 
            Or (lngMatchAt2 + lngLocalLongestMatch < end2)) _ 
           , "*", "") 
    End If 
          'Return result 
    Similarity_sub = lngLongestMatch 

End Function

来源

2017-02-15 jonv

如果您可以使用缩写词及其定义创建数组（可能在另一个工作表中？），则可以使用检查来检查值是否引用表中的索引/匹配。这可能是Select Case的一部分，其中第一个Case是您的典型支票，第二个Case是此索引/匹配支票，而您的第三个Case是“不相似”。只是一个想法。 – Cyril

没有太多参与到解决方案，那就是你自己的责任，我可以推荐一些方法，将那些缩写。然而。请注意，这种方法不能保证100％成功，但你已经处于模糊的世界。

假设我们有一个Dictionary其中：

的关键是长词组
的值是缩写

比较两个字符串之前，我们减少他们两人，用每个出现的短语替换其缩写。然后我们可以将它们与其他方法Similarity（或通过任何其他方法）进行比较。

' Fills an abbreviation dictionary 
Sub InitializeDict(ByRef abbrev As Scripting.Dictionary) 
    abbrev("limited liability company") = "LLC" 
    abbrev("United Kingdom") = "U.K." 
    '... Add all abbreviations into dict 

    ' Instead of harcoding, you can better load the key/value 
    ' pairs from a dedicated worksheet... 

End Sub 

' Minimizes s by putting abbreviations 
Sub Abbreviate(ByRef s As String) 
    Static abbrev As Scripting.Dictionary ' <-- static, inititlized only once 
    If abbrev Is Nothing Then 
     Set abbrev = CreateObject("Scripting.Dictionary") 
     abbrev.CompareMode = vbTextCompare 
     InitializeDict abbrev 
    End If 

    Dim phrase 
    For Each phrase In abbrev.Keys 
     s = Replace(s, phrase, abbrev(phrase), vbTextCompare) 
    Next 
End Sub 

' A small amendment to this function: abbreviate strings before comparing 
Public Function Similarity(ByVal String1 As String, _ 
         ByVal String2 As String, _ 
         Optional ByRef RetMatch As String, _ 
         Optional min_match = 1) As Single 

    Abbreviate String1 
    Abbreviate String2 
    ' ... Rest of the routine 
End Function

来源

2017-02-15 19:40:25

想我明白了 - 非常感谢！ – jonv

@jonv欢迎，请保持我们更新，如果实施这个想法（这实际上是你的，我只是建议一个技术实现），大大提高了你的相似性检查。我很感兴趣;） –

可能更容易检查字符串是否彼此为Like。例如，

If "ABC limited liability company" Like "ABC L*L*C*" Then

为真，因为*匹配任何0个或多个字符。

Option Compare Text ' makes string comparisons case insensitive 

Function areLike(str1 As String, str2 As String) As Single 

    If str1 = str2 Then areLike = 1: Exit Function 

    Dim pattern As String, temp As String 

    If LenB(str1) < LenB(str2) Then 
     pattern = str1 
     temp = str2 
    Else 
     pattern = str2 
     temp = str1 
    End If 

    pattern = StrConv(pattern, vbUnicode)  ' "ABC LLC" to "A␀B␀C␀ ␀L␀L␀C␀" 
    pattern = Replace(pattern, vbNullChar, "*") ' "A*B*C* *L*L*C*" 
    pattern = Replace(pattern, " *", " ")  ' "A*B*C* L*L*C*" 

    If temp Like pattern Then areLike = 1: Exit Function 

    ' else areLike = some other similarity function 

End Function

来源

2017-02-15 22:29:06 Slai

运行vb代码计算相似度时定义首字母缩略词

回答

相关问题