在python正则表达式中匹配unicode字符

我已经阅读了Stackoverflow中的其他问题，但仍然没有更接近。对不起，如果这已经得到解答，但我没有得到任何建议。在python正则表达式中匹配unicode字符

>>> import re 
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg') 
>>> print m.groupdict() 
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

一切都很好，然后我尝试用挪威语字符的东西它（或更多的东西Unicode的等）：

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg') 
>>> print m.groupdict() 
Traceback (most recent call last): 
File "<interactive input>", line 1, in <module> 
AttributeError: 'NoneType' object has no attribute 'groupdict'

我如何可以匹配典型的Unicode字符，如øæå？我希望能够匹配这些字符以及上面的标记组和文件名。

来源

2011-02-17 Weholt

确保您[规范化]（HTTPS： //docs.python.org/2/library/unicodedata.html#unicodedata.normalize）你的字符串，因为有不同的码点序列产生相同的视觉外观。 – janbrohl 2016-08-26 17:25:40

你需要指定re.UNICODE标志，和输入您的字符串使用u前缀Unicode字符串：

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict() 
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

这是在Python 2中;在Python 3中，您必须省略u，因为所有字符串都是Unicode。

来源

2011-02-17 12:18:18 Thomas

+1表示：并使用u前缀将您的字符串作为Unicode字符串输入 – Tamm 2013-12-18 15:56:21

您需要UNICODE标志：

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

来源

2011-02-17 12:12:47

Python3也需要它吗？ – Kevin 2016-10-04 07:36:05

在Python 2，你需要的re.UNICODE标志和unicode字符串构造

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE) 
u',./___-=+' 
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE) 
u',./___-=+' 
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE) 
u',./___-=+' 
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE) 
u',./___-=+' 
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE) 
u',./___\uff0c___-=+' 
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE) 
,./___，___-=+

（在后一种情况下，逗号是中国逗号。）

来源

2012-10-25 05:46:29 18446744073709551615

在python正则表达式中匹配unicode字符

回答

相关问题