在Python中删除任何给定的字符串类型的非ASCII字符

>>> teststring = 'aõ' 
>>> type(teststring) 
<type 'str'> 
>>> teststring 
'a\xf5' 
>>> print teststring 
aõ 
>>> teststring.decode("ascii", "ignore") 
u'a' 
>>> teststring.decode("ascii", "ignore").encode("ascii") 
'a'

这是我真正想让它在内部存储，因为我删除非ASCII字符。为什么解码（“ASCII给出一个Unicode字符串？在Python中删除任何给定的字符串类型的非ASCII字符

>>> teststringUni = u'aõ' 
>>> type(teststringUni) 
<type 'unicode'> 
>>> print teststringUni 
aõ 
>>> teststringUni.decode("ascii" , "ignore") 

Traceback (most recent call last): 
    File "<pyshell#79>", line 1, in <module> 
    teststringUni.decode("ascii" , "ignore") 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128) 
>>> teststringUni.decode("utf-8" , "ignore") 

Traceback (most recent call last): 
    File "<pyshell#81>", line 1, in <module> 
    teststringUni.decode("utf-8" , "ignore") 
    File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128) 
>>> teststringUni.encode("ascii" , "ignore") 
'a'

这又是我想要的。我不明白这个问题。有人能向我解释这里发生了什么？

编辑：我认为这将我了解的东西，所以我可以解决我的真正的程序问题，我在此声明： Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

来源

2010-09-08 fullmooninu

很简单：.encode将Unicode对象转换为字符串，.decode将字符串转换为Unicode。

来源

2010-09-08 13:25:39

这个角度实际上解决了它=），谢谢 – fullmooninu 2010-09-08 15:57:56

如果这不起作用，也尝试使用BeautifulSoup（html）.encode为html或正则表达式模块 – 2014-09-03 14:57:27

为什么解码（“ASCII”）给出了一个Unicode字符串

，因为这是decode是为：它解码字节串像你ASCII一个为Unicode。

在你的第二个例子中，你试图“解码”一个已经是unicode的字符串，它没有任何作用。然而，要将其打印到终端，Python必须将其编码为默认编码，即ASCII - 但由于您没有明确执行该步骤，因此未指定“忽略”参数，因此会引发错误不能编码非ASCII字符。

所有这一切的技巧是记住，decode需要一个编码的字节串并将其转换为Unicode，而encode则相反。如果您明白Unicode不是编码，则可能会更容易。

来源

2010-09-08 13:25:03

那么，你是对的，除了一些细节。由于他可以正确地打印一个\ xf5'，他的终端编码不是ascii，而是..其他的东西。控制台编码是一个非常常见的问题，但这次不是这种情况。另外，当您尝试打印结果时，'teststringUni.decode（“ascii”，“ignore”）'不会失败。它告诉Python teststringUni是一个ascii编码的字符串（它显然是unicode，但是Python信任用户），并试图对它进行解码 - 这不能工作。 – 2010-09-08 14:06:49

是的，我认为这是问题：我的终端编码是什么？仅仅因为对象类型是字符串，并不意味着编码是ascii，我明白这一点。我现在的问题是想知道如何将类型为unicode的东西翻译成终端的字符串类型，同时保留所有信息。 – fullmooninu 2010-09-08 15:03:14

在Python中删除任何给定的字符串类型的非ASCII字符

回答

相关问题