Python正则表达式字符串到单词列表（包括带连字符的单词）

我想解析一个字符串以获取包含所有单词（带连字符的单词）的列表。当前的代码：Python正则表达式字符串到单词列表（包括带连字符的单词）

s = '-this is. A - sentence;one-word' 
re.compile("\W+",re.UNICODE).split(s)

回报：

['', 'this', 'is', 'A', 'sentence', 'one', 'word']

，我想它返回：

['', 'this', 'is', 'A', 'sentence', 'one-word']

来源

2010-08-04 Antonio

为什么你喜欢的“”？ – 2010-08-04 19:39:19

你可以使用"[^\w-]+"代替。

来源

2010-08-04 14:58:19 Jens

这将返回' - 这个'，但我知道没有更好的解决方案。我觉得没有办法再一次回过头来去除不必要的缺点。 – 2010-08-04 15:01:00

呦可以与NLTK库尝试：

>>> import nltk 
>>> s = '-this is a - sentence;one-word' 
>>> hyphen = r'(\w+\-\s?\w+)' 
>>> wordr = r'(\w+)' 
>>> r = "|".join([ hyphen, wordr]) 
>>> tokens = nltk.tokenize.regexp_tokenize(s,r) 
>>> print tokens 
['this', 'is', 'a', 'sentence', 'one-word']

我在这里找到：http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html希望它可以帮助

来源

2010-08-04 15:02:35 fasouto

如果您不需要主导空字符串，你可以使用该模式为\w(?:[-\w]*\w)?匹配：

>>> import re 
>>> s = '-this is. A - sentence;one-word' 
>>> rx = re.compile(r'\w(?:[-\w]*\w)?') 
>>> rx.findall(s) 
['this', 'is', 'A', 'sentence', 'one-word']

注意与像012撇号，它将不匹配的话。

来源

2010-08-04 15:15:00 kennytm

谢谢，它的工作 – Sibish 2015-04-06 09:58:11

s = "-this is. A - sentence;one-word what's" re.findall("\w+-\w+|[\w']+",s)

结果： [ '这个'， '是'， 'A'， '一句话'， '一个字'， “什么是”]

确保您注意到正确的排序是先查找hyypenated的单词！

来源

2010-08-04 16:50:39 pyInTheSky

这里我传统的“为什么要使用正则表达式语言时，你可以使用Python的”另类：

import string 
s = "-this is. A - sentence;one-word what's" 
s = filter(None,[word.strip(string.punctuation) 
       for word in s.replace(';','; ').split() 
       ]) 
print s 
""" Output: 
['this', 'is', 'A', 'sentence', 'one-word', "what's"] 
"""

来源

2010-08-04 19:33:49

Python正则表达式字符串到单词列表（包括带连字符的单词）

回答

相关问题