Python Unicode正则表达式

我正在使用python 2.4，并且遇到了unicode正则表达式的一些问题。我试图将一个非常清晰和简明的例子解释为我的问题。它看起来好像是Python如何识别不同的字符编码或者我的理解有问题。非常感谢您参观！Python Unicode正则表达式

#!/usr/bin/python 
# 
# This is a simple python program designed to show my problems with regular expressions and character encoding in python 
# Written by Brian J. Stinar 
# Thanks for the help! 

import urllib # To get files off the Internet 
import chardet # To identify charactor encodings 
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using 

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() 
print (chardet.detect(rawdata)) 
#print (rawdata) 

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text 
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8 
print(chardet.detect(UTF_8_encoded)) # Looks good 

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML 
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE) 
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8") 
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data") 

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE) 
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!? 
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8") 

''' 
# In additon, I tried this regular expression library much to the same unsatisfactory result 
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*") 
if new_re.match(UTF_8_encoded) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8") 
else: 
    print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8") 

if new_re.match(rawdata) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data") 
else: 
    print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data") 

new_re = ponyguruma.Regexp(".*Adobe.*") 
if new_re.match(UTF_8_encoded) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8") 
else: 
    print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8") 

new_re = ponyguruma.Regexp(".*Adobe.*") 
if new_re.match(rawdata) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data") 
else: 
    print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data") 
'''

我正在开发一个替代项目，并且在使用非ASCII编码文件时遇到困难。这个问题是一个更大的项目的一部分 - 最终我想用其他文本替换文本（我用ASCII工作，但我无法确定其他编码中的出现）。再次感谢。

http://brian-stinar.blogspot.com

布赖恩J. Stinar-

来源

2009-07-22 Brian Stinar

东西完全是从你的描述缺少的是在你的代码失败的方式。你在你的代码中编写*“＃这完全不起作用”*，但是你没有提示它如何不起作用。打印的字符串是否为空？你会得到错误消息/堆栈跟踪？ – ThomasH 2009-07-23 12:15:31

你可能想要么使DOTALL标志，或者您想使用的，而不是match方法search方法。即：

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或：

# search will find matches even if they aren't at the start of the string 
... re_UNSUB_amsterdam.search(foo) ...

这些会给你不同的结果，但两者应该给你匹配。（看看哪一个是你想要的类型。）

顺便说一句：你似乎正在获取编码文本（这是字节）和解码文本（字符）混淆。这并不罕见，特别是在3.x之前的Python中。具体而言，这是非常可疑：

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

你德与ISO-8859-2，不EN -coding -coding，所以叫这个变量 “解码”。（为什么不“ISO_8859_2_decoded”？因为ISO_8859_2是一种编码，解码后的字符串不再具有编码）

其余代码尝试在rawdata和UTF_8_encoded（两种编码字符串）上进行匹配它可能应该使用解码的Unicode字符串。

来源

2009-07-23 00:14:40

非常感谢。添加完re.DOTALL标志后，其行为与我所期望的完全相同。它看起来像。*在ASCII上表现不同，在ASCII中，它与我匹配的换行符，但与解码的非ASCII不是，但我可能只是不清楚这一点。感谢您澄清编码文本和解码文本。这是我处理不同编码的第一个项目，我赞赏澄清。 – 2009-07-24 14:37:48

这可能帮助：http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

来源

2009-07-23 00:02:25 b3rx

使用默认标志设置，。*与换行符不匹配。在第一个换行符后，UNSUBSCRIBE只出现一次。 Adobe在第一个换行符之前发生。你可以通过使用re.DOTALL来解决这个问题。

然而，你没有检查你得到的与Adobe匹配：它的1478字节宽！打开re.DOTALL，它（和相应的UNSUBSCRIBE模式）将匹配整个文本！

你绝对需要失去最后的结果。* - 你不感兴趣并且会减慢比赛速度。你也应该失去领先。*并使用search（）而不是match（）。

在这种情况下，re.UNICODE标志对您没有用处 - 请阅读手册并查看其功能。

为什么要将数据转码为UTF-8并在其上搜索？留在Unicode中。

其他人指出，一般来说，你需要做你的数据的任何严肃的工作之前，Ӓ等一样的东西......解码但未提及与您的数据穿插:-)

来源

2009-07-23 02:11:17

的 «等一样的东西

你的问题是关于正则表达式的，但是你的问题可以在没有它们的情况下解决。改为使用标准字符串replace的方法。

import urllib 
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() 
decoded = raw.decode('iso-8859-2') 
type(decoded) # decoded is now <type 'unicode'> 
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

如果没有别的，上面显示了如何处理编码：简单地解码成一个Unicode字符串并使用它。但是请注意，这只适用于只有一个或很少数量的替换（以及那些替换不是基于模式）的情况，因为replace()一次只能处理一个替换。

对于这两个字符串，并基于模式替代，你可以做这样的事情，一次实现多个替代：

import re 
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'), 
       (u'UNS.*IBE', u'@[email protected]'), 
       (u'Dublin', u'Sydney')) 

def replacer(m): 
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1] 

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS)) 
substituted = r.sub(replacer, decoded)

来源

2009-07-23 04:01:02 mhawke

Python Unicode正则表达式

回答

相关问题