我做我自己的东西给你(蟒蛇2.7):
from __future__ import division
import time
from itertools import izip
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += len(word)
if counter:
return counter/len(' '.join(s_bigger))*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "<simliar or similar>/<length of bigger>*100%"
print 7/len(one)*100
print
print "Equals?"
print 7/len(one)*100 == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
所以我觉得我的速度更快,但对您更准确?你决定。
编辑 现在不仅速度更快,而且更准确。
结果:
match: 41.1764705882
compare: --- 4.19616699219e-05 seconds ---
match: 50
fuzzy: --- 7.39097595215e-05 seconds ---
<simliar or similar>/<length of bigger>*100%
41.1764705882
Equals?
True
Faster than fuzzy?
True
当然,如果你有话请检查像fuzzywuzzy的话,那么在这里你去:
from __future__ import division
from itertools import izip
import time
from fuzzywuzzy import fuzz
one = "different simliar"
two = "similar"
def compare(first, second):
smaller, bigger = sorted([first, second], key=len)
s_smaller= smaller.split()
s_bigger = bigger.split()
bigger_sets = [set(word) for word in s_bigger]
counter = 0
for word in s_smaller:
if set(word) in bigger_sets:
counter += 1
if counter:
return counter/len(s_bigger)*100 # percentage match
return counter
start_time = time.time()
print "match: ", compare(one, two)
compare_time = time.time() - start_time
print "compare: --- %s seconds ---" % (compare_time)
start_time = time.time()
print "match: ", fuzz.ratio(one, two)
fuzz_time = time.time() - start_time
print "fuzzy: --- %s seconds ---" % (fuzz_time)
print
print "Equals?"
print fuzz.ratio(one, two) == compare(one, two)
print
print "Faster than fuzzy?"
print compare_time < fuzz_time
结果:
match: 50.0
compare: --- 7.20024108887e-05 seconds ---
match: 50
fuzzy: --- 0.000125169754028 seconds ---
Equals?
True
Faster than fuzzy?
True
如果您确实需要单独比较每个元素与其他元素,则无法绕过您所关心的昂贵的O(n^2)双循环操作。但是,如果您提供有关您尝试解决的问题的更多信息,涉及的元素的类型以及为什么您觉得必须对每个元素进行比较,我们可能会帮助您进行优化。 –
这个想法是要计算这些1500条语句中的每条语句在推文列表(其中包含几千条条目)中出现的次数。 – VnC