2016-02-29 90 views
2

我有一个大样本的文字,例如:模糊搜索的Python

“动脉高血压可接合预后 存活病人为并发症的结果TENSTATEN进入 框架内。 (治疗) 他的(她,她的)报告(关系)效率/效果不需要的是 重要的。利尿剂,第一意向的药物TENSTATEN, 是。

我试图检测是否在文本中以模糊的方式“参与预测生存”。例如“参与生存的程序”也必须返回一个肯定的答案。

我看着fuzzywuzzy,NLTK和新的正则表达式的模糊功能,但我没有找到一个方法来做到:

if [anything similar (>90%) to "that sentence"] in mybigtext: 
    print True 
+0

即时通讯新的在这里,但我认为这应该解决您的问题:http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 –

+0

看看[gensim](https:/ /radimrehurek.com/gensim/index.html),特别是[相似部分](https://radimrehurek.com/gensim/tut3.html)。 – Jan

回答

0

有低于此,如果一个字包含的文本它将里面的函数显示一个匹配。您可以即兴创作,以便在文本中检查完整的短语。

这是我提出的功能:

def FuzzySearch(text, phrase): 
    """Check if word in phrase is contained in text""" 
    phrases = phrase.split(" ") 

    for x in range(len(phrases)): 
     if phrases[x] in text: 
      print("Match! Found " + phrases[x] + " in text") 
     else: 
      continue 
+0

是啊,这是我的第一次猜测,但没办法使句子明智模糊... –

1

以下是不理想,但它应该让你开始。它首先使用nltk将文本分成单词,然后生成一个包含所有单词的词干的集合,过滤任何停用词。它可以为您的示例文本和示例查询做到这一点。

如果两个集合的交集包含查询中的所有单词,则认为它是匹配的。

import nltk 

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

stop_words = stopwords.words('english') 
ps = PorterStemmer() 

def get_word_set(text): 
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words) 

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

query = "engage the prognosis for survival" 

set_query = get_word_set(query) 
for text in [text1, text2]: 
    set_text = get_word_set(text) 
    intersection = set_query & set_text 

    print "Query:", set_query 
    print "Test:", set_text 
    print "Intersection:", intersection 
    print "Match:", len(intersection) == len(set_query) 
    print 

该脚本提供两个文本,一个通行证和其他没有,它产生以下输出向您展示它在做什么:

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'prognosi', u'engag', u'surviv']) 
Match: True 

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'engag', u'surviv']) 
Match: False 
+0

是的,我想过这种可能性! 如果我真的找不到任何其他解决方案,我会使用那个!谢谢 ! –

1

使用regex模块,第一次分裂的句子然后测试是否模糊图案是在句子:

tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt): 
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5) 
    m=regex.search(pat, sentence) 
    if m: 
     print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts) 

打印:

'(?e)((?:has engage the progronosis of survival){e<10})' 
    fuzzy matches 
'may engage the prognosis for survival' 
    with 
3 substitutions, 1 insertions, 2 deletions 
+0

因此,通过玩数字模糊数字像限制他们......我可以做一些事情之间的区别:'已经搞预后'和'不搞预后' 这似乎是完美的感谢!如果是这种情况,我会尽力解决我的问题。 –