2016-07-27 34 views
4

I的一部分有一对串查找接触和非接触的两个字符串

YHFLSPYVY  # answer 
    LSPYVYSPR # prediction 
+++******ooo 


    YHFLSPYVS # answer 
VEYHFLSPY  # prediction 
oo*******++ 

如上所述以上我想找到重叠区域(*)和非重叠区域中的答案的两个例子(+)和预测(o)。

我该怎么用Python做到这一点?

我坚持这个

import re 
# This is of example 1 
ans = "YHFLSPYVY" 
pred= "LSPYVYSPR" 
matches = re.finditer(r'(?=(%s))' % re.escape(pred), ans) 
print [m.start(1) for m in matches] 
#[] 

的答案,我希望能得到例如1:

plus_len = 3 
star_len = 6 
ooo_len = 3 
+1

你想第一重叠?还是最长的重叠? –

+0

你是否也想要带* + o的字符串或只是plus_len等的值? –

+0

看起来像[最长的公共子序列](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) –

回答

3

这很容易与difflib.SequenceMatcher.find_longest_match

from difflib import SequenceMatcher 

def f(answer, prediction): 
    sm = SequenceMatcher(a=answer, b=prediction) 
    match = sm.find_longest_match(0, len(answer), 0, len(prediction)) 
    star_len = match.size 
    return (len(answer) - star_len, star_len, len(prediction) - star_len) 

该函数返回一个整数的三元组(plus_len, star_len, ooo_len)

f('YHFLSPYVY', 'LSPYVYSPR') -> (3, 6, 3) 
f('YHFLSPYVS', 'VEYHFLSPY') -> (2, 7, 2) 
+0

SO就像一支超级聪明的大脑,连一秒都没有,问题回答了:D! –

1

您可以使用difflib

import difflib 

ans = "YHFLSPYVY" 
pred = "LSPYVYSPR" 

def get_overlap(s1, s2): 
    s = difflib.SequenceMatcher(None, s1, s2) 
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size] 

overlap = get_overlap(ans, pred) 
plus = ans.replace(get_overlap(ans, pred), "") 
oo = pred.replace(get_overlap(ans, pred), "") 

print len(overlap) 
print len(plus) 
print len(oo)