python正则表达式可否定单词列表吗？

我必须匹配文本中的所有字母数字单词。python正则表达式可否定单词列表吗？

>>> import re 
>>> text = "hello world!! how are you?" 
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text) 
>>> final_list 
['hello', 'world', 'how', 'are', 'you'] 
>>>

这很好，但我进一步否定了不应该在我的最终名单中的单词。

>>> negate_words = ['world', 'other', 'words']

一个糟糕的方式做到这一点

>>> negate_str = '|'.join(negate_words) 
>>> filter(lambda x: not re.match(negate_str, x), final_list) 
['hello', 'how', 'are', 'you']

但我可以节省一个循环，如果我的第一个正则表达式模式是可以改变的考虑的那些话否定。我发现否定字符，但我有话否定，也发现正则表达式在其他问题，但这也没有帮助。

是否可以使用python re？

更新

我的文字可以跨越几个hundered线。此外，negate_words列表也可能很长。

考虑到这一点，正在使用正则表达式来处理这样的任务，正确的处于第一位？有什么建议？

来源

2011-11-30 simplyharsh

有很多'negate_words'的？ –

@bitsMiz是的，可以有很多否定词。文本也可以跨越很少的线条。 – simplyharsh

我不认为有一个干净的方式来使用正则表达式来做到这一点。我能找到的最接近的是有点难看，并不完全是你想要的：

>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text) 
['hello', '', 'how', 'are', 'you']

为什么不使用Python的集合。它们非常快：

>>> list(set(final_list) - set(negate_words)) 
['hello', 'how', 'are', 'you']

如果订单很重要，请参阅下面的@glglgl回复。他的列表理解版本非常易读。下面是使用itertools快速但不可读相当于：

>>> negate_words_set = set(negate_words) 
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list)) 
['hello', 'how', 'are', 'you']

另一种选择是在单次使用re.finditer积聚的单词列表：

>>> result = [] 
>>> negate_words_set = set(negate_words) 
>>> result = [] 
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text): 
    word = mo.group() 
    if word not in negate_words_set: 
     result.append(word) 

>>> result 
['hello', 'how', 'are', 'you']

来源

2011-11-30 09:09:09

值得一提的是，词序将会丢失。 – DrTyrsa

'[我为我在final_list如果我不在negate_words_set]' – glglgl

@raymond，啊！你确定吗？但无论如何，我可以绝对用你提到的set来代替我的过滤函数。 – simplyharsh

也许这是值得尝试pyparsing：

>>> from pyparsing import * 

>>> negate_words = ['world', 'other', 'words'] 
>>> parser = OneOrMore(Suppress(oneOf(negate_words))^Word(alphanums)).ignore(CharsNotIn(alphanums)) 
>>> parser.parseString('hello world!! how are you?').asList() 
['hello', 'how', 'are', 'you']

注意oneOf(negate_words)必须Word(alphanums)之前，为了确保它早些时候匹配。

编辑：只是为了好玩，我重复使用lepl（也是一个有趣的解析库）行使

>>> from lepl import * 

>>> negate_words = ['world', 'other', 'words'] 
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any()) 
>>> parser.parse('hello world!! how are you?') 
['hello', 'how', 'are', 'you']

来源

2011-11-30 09:46:44 jcollado

不要问无谓过多的正则表达式。
相反，想想发电机。

import re 

unwanted = ('world', 'other', 'words') 

text = "hello world!! how are you?" 

gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text)) 
li = [ w for w in gen if w not in unwanted ]

和发电机可以被创建，而不是李，也

来源

2011-11-30 14:05:34 eyquem

python正则表达式可否定单词列表吗？

回答

相关问题