模糊搜索的Python

我有一个大样本的文字，例如：模糊搜索的Python

“动脉高血压可接合预后存活病人为并发症的结果TENSTATEN进入框架内。（治疗）他的（她，她的）报告（关系）效率/效果不需要的是重要的。利尿剂，第一意向的药物TENSTATEN，是。

我试图检测是否在文本中以模糊的方式“参与预测生存”。例如“参与生存的程序”也必须返回一个肯定的答案。

我看着fuzzywuzzy，NLTK和新的正则表达式的模糊功能，但我没有找到一个方法来做到：

if [anything similar (>90%) to "that sentence"] in mybigtext: 
    print True

来源

2016-02-29 Mickael_Paris

即时通讯新的在这里，但我认为这应该解决您的问题：http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 –

看看[gensim]（https：/ /radimrehurek.com/gensim/index.html），特别是[相似部分]（https://radimrehurek.com/gensim/tut3.html）。 – Jan

有低于此，如果一个字包含的文本它将里面的函数显示一个匹配。您可以即兴创作，以便在文本中检查完整的短语。

这是我提出的功能：

def FuzzySearch(text, phrase): 
    """Check if word in phrase is contained in text""" 
    phrases = phrase.split(" ") 

    for x in range(len(phrases)): 
     if phrases[x] in text: 
      print("Match! Found " + phrases[x] + " in text") 
     else: 
      continue

来源

2016-02-29 17:52:16

是啊，这是我的第一次猜测，但没办法使句子明智模糊... –

以下是不理想，但它应该让你开始。它首先使用nltk将文本分成单词，然后生成一个包含所有单词的词干的集合，过滤任何停用词。它可以为您的示例文本和示例查询做到这一点。

如果两个集合的交集包含查询中的所有单词，则认为它是匹配的。

import nltk 

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

stop_words = stopwords.words('english') 
ps = PorterStemmer() 

def get_word_set(text): 
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words) 

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

query = "engage the prognosis for survival" 

set_query = get_word_set(query) 
for text in [text1, text2]: 
    set_text = get_word_set(text) 
    intersection = set_query & set_text 

    print "Query:", set_query 
    print "Test:", set_text 
    print "Intersection:", intersection 
    print "Match:", len(intersection) == len(set_query) 
    print

该脚本提供两个文本，一个通行证和其他没有，它产生以下输出向您展示它在做什么：

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'prognosi', u'engag', u'surviv']) 
Match: True 

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'engag', u'surviv']) 
Match: False

来源

2016-02-29 20:32:08

是的，我想过这种可能性！如果我真的找不到任何其他解决方案，我会使用那个！谢谢！ –

使用regex模块，第一次分裂的句子然后测试是否模糊图案是在句子：

tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt): 
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5) 
    m=regex.search(pat, sentence) 
    if m: 
     print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts)

打印：

'(?e)((?:has engage the progronosis of survival){e<10})' 
    fuzzy matches 
'may engage the prognosis for survival' 
    with 
3 substitutions, 1 insertions, 2 deletions

来源

2016-02-29 21:41:19 dawg

因此，通过玩数字模糊数字像限制他们......我可以做一些事情之间的区别：'已经搞预后'和'不搞预后' 这似乎是完美的感谢！如果是这种情况，我会尽力解决我的问题。 –

模糊搜索的Python

回答

相关问题