用Python正则表达式处理Unicode字符

我在写一个简单的应用程序，我想用其他词替换某些单词。我遇到了使用单引号的问题，例如aren't，ain't，isn't。用Python正则表达式处理Unicode字符

我有以下

aren’t=ain’t 
hello=hey

我分析文本文件的文本文件，并创建一个字典出它

u'aren\u2019t' = u'ain\u2019t' 
u'hello' = u'hey'

然后我试图取代在给定文本中的所有字符

text = u"aren't" 

def replace_all(text, dict): 
    for i, k in dict.iteritems(): 
     #replace all whole words of I with K in lower cased text, regex = \bSTRING\b 
     text = re.sub(r"\b" + i + r"\b", k , text.lower()) 
    return text

问题是re.sub()不符合u'aren\u2019t'与u"aren't"。

我该怎么做，以便我的replace_all()函数能够匹配"hello"和`“不是”并且用适当的文本替换它们？我可以在Python中做些什么，以便我的字典不包含Unicode？我可以将文本转换为使用Unicode字符，还是可以修改正则表达式以匹配Unicode字符以及所有其他文本？

来源

2011-02-23 Pim

你想什么输出获得？ – Asterisk 2011-02-23 22:52:02

预期的结果是文本“不是”被替换为“不是”。 – Pim 2011-02-24 15:21:53

我猜你的问题是：中

text = u"aren't"

代替：（？注意不同的撇号）

text = u"aren’t"

这里是你的代码修改，使其工作：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import re 

d = { 
    u'aren’t': u'ain’t', 
    u'hello': u'hey' 
    } 
#text = u"aren't" 
text = u"aren’t" 


def replace_all(text, d): 
    for i, k in d.iteritems(): 
     #replace all whole words of I with K in lower cased text, regex = \bSTRING\b 
     text = re.sub(r"\b" + i + r"\b", k , text.lower()) 
    return text 

if __name__ == '__main__': 
    newtext = replace_all(text, d) 
    print newtext

输出：

ain’t

来源

2011-02-23 22:59:53 Mikel

能够解决来自具有不同类型的撇号的文本的问题 – Pim 2011-02-24 15:24:24

这工作正常，我在Python 2.6.4：

>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t') 
u'rep'

确保您的模式字符串是Unicode字符串，否则可能无法正常工作。

来源

2011-02-23 22:52:44

尝试保存文件为UTF-8编码

来源

2011-02-23 22:53:18 eos87

u"aren\u2019t" == u"aren't"

假

u"aren\u2019t" == u"aren’t"

真

来源

2011-02-23 23:37:58 intrepion

用Python正则表达式处理Unicode字符

回答

相关问题