我认为你应该使用Longest Common Subsequence代替Levenshtein距离的长度。这对你的情况来说似乎是一个更好的指标。本质上,它优先插入和删除替换,如我在我的评论中所建议的。 (Inggers) - >“Ingersoll”和“Ing” - >“Boylan”(分数为3和1)处理没有问题的空间(“New Ing” - >“New Ingersoll”得分7其中“New Ing” - >“Boylan”再次得分1),并且如果遇到像“Ingsl”这样的缩写,也会很好地工作。
算法很简单。如果您的两个字符串的长度为m和n,请按字符比较字符串的连续前缀(以空白前缀开头),将分数保存在大小为m + 1,n + 1的矩阵中。如果某一对匹配,则在前两个前缀的分数中添加一个(矩阵中剩下的一行上一列);否则保持这些前缀的两个分数中的最高分(比较上面的条目和立即留下的条目并采取最好的条件)。当你经历了两个字符串时,得分矩阵中的最后一项是LCS的长度。
为 “Ingsll” 和 “的Ingersoll”
例记分矩阵:
0 1 2 3 4 5 6
I n g s l l
---------------
0 | 0 0 0 0 0 0 0
1 I | 0 1 1 1 1 1 1
2 n | 0 1 2 2 2 2 2
3 g | 0 1 2 3 3 3 3
4 e | 0 1 2 3 3 3 3
5 r | 0 1 2 3 3 3 3
6 s | 0 1 2 3 4 4 4
7 o | 0 1 2 3 4 4 4
8 l | 0 1 2 3 4 5 5
9 l | 0 1 2 3 4 5 6
下面是一个ObjC执行长度。这里的大部分复杂性只是由于想要处理组合字符序列 - 例如@"o̶"
- 正确。
#import <Foundation/Foundation.h>
@interface NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength;
@end
@implementation NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength
{
__block NSUInteger length = 0;
[self enumerateSubstringsInRange:(NSRange){0, [self length]}
options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
length++;
}];
return length;
}
@end
@interface NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;
@end
@implementation NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
return scores[[target WSSComposedLength]][[self WSSComposedLength]];
}
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
//FIXME: Implement this.
return nil;
}
- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{
NSUInteger selfLength = [self WSSComposedLength];
NSUInteger targetLength = [target WSSComposedLength];
NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];
NSUInteger ** scores = [scoresData mutableBytes];
for(NSUInteger i = 0; i <= targetLength; i++){
scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];
}
/* Ranges in the enumeration Block are the same measure as
* -[NSString length] -- i.e., 16-bit code units -- as opposed to
* _composed_ length, which counts code points. Thus:
*
* Enumeration will miss the last character if composed length is used
* as the range and there's a substring range with length greater than one.
*/
NSRange selfFullRange = (NSRange){0, [self length]};
NSRange targetFullRange = (NSRange){0, [target length]};
/* Have to keep track of these indexes by hand, rather than using the
* Block's substringRange.location because, e.g., @"o̶", will have
* range {x, 2}, and the next substring will be {x+2, l}.
*/
__block NSUInteger col = 0;
__block NSUInteger row = 0;
[target enumerateSubstringsInRange:targetFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * targetSubstring,
NSRange targetSubstringRange,
NSRange _, BOOL * _0)
{
row++;
col = 0;
[self enumerateSubstringsInRange:selfFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * selfSubstring,
NSRange selfSubstringRange,
NSRange _, BOOL * _0)
{
col++;
NSUInteger newScore;
if([selfSubstring isEqualToString:targetSubstring]){
newScore = 1 + scores[row - 1][col - 1];
}
else {
NSUInteger upperScore = scores[row - 1][col];
NSUInteger leftScore = scores[row][col - 1];
newScore = MAX(upperScore, leftScore);
}
scores[row][col] = newScore;
}];
}];
return scoresData;
}
@end
int main(int argc, const char * argv[])
{
@autoreleasepool {
NSArray * testItems = @[@{@"source" : @"Ingso̶ll",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @6,
@"sequence" : @"Ingsll"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @6,
@"sequence" : @"Ingsll"}]},
@{@"source" : @"Ing",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"}]},
@{@"source" : @"New Ing",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @7,
@"sequence" : @"New Ing"}]}];
for(NSDictionary * item in testItems){
NSString * source = item[@"source"];
for(NSDictionary * target in item[@"targets"]){
NSString * targetString = target[@"string"];
NSCAssert([target[@"score"] integerValue] ==
[source WSSLengthOfLongestCommonSubsequenceWithString:targetString],
@"");
// NSCAssert([target[@"sequence"] isEqualToString:
// [source longestCommonSubsequenceWithString:targetString]],
// @"");
}
}
}
return 0;
}
你可以改变打分到折扣插入?对于这个缩写用例,似乎你应该将替代计算为比插入更大的距离。就我所知,你根本就没有在寻找拼写错误。 – 2014-11-06 20:46:56
对,我在计算大部分情况下用序列替换空格。这实际上是一个很好的观察。但是,你如何计算这些数据呢? – Moshe 2014-11-06 20:50:56
嗯,这取决于你的得分的细节,但我说的是 - 为了你的目的,但不是在经典的莱文斯坦距离 - “我” - >“B”应该比“我” >“Ie”。 – 2014-11-06 20:56:07