我正在使用Python 3.6.0b2。'utf-8'编解码器无法编码字符' udcc2':代理不允许
我解析了很多电子邮件。这个特定的电子邮件是一个问题,因为我无法打印电子邮件地址的显示名称。试图打印的电子邮件地址显示名称给出:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 30: surrogates not allowed
下面是测试情况下一段代码,显示了如何重现该问题:
(venv3.6) [email protected]:/opt/mailripper$ cat test.py
from email import policy
from email.headerregistry import Address
from email.parser import BytesHeaderParser, BytesParser
email_bytes = b'From: =?utf-8?Q?John_Smith=2C_Prince2=C2=AE=2CPMP=C2=AE=2C_CSM=C2?=\r\n =?utf-8?Q?=AE=2C_ITIL=C2=AE=2C_ISTQB=C2=AE?= <[email protected]>\r\n'
msg = BytesHeaderParser(policy=policy.default).parsebytes(email_bytes)
print(msg['from'])
print(msg['from'].addresses[0].display_name)
这里是如由上面的代码生成的错误:
(venv3.6) [email protected]:/opt/mailripper$ python test.py
"John Smith, Prince2®,PMP®, CSM� �, ITIL®, ISTQB®" <[email protected]>
Traceback (most recent call last):
File "test.py", line 8, in <module>
print(msg['from'].addresses[0].display_name)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 30: surrogates not allowed
这里是作为OSX电子邮件客户端,这似乎能够解析就OK了(这是截图,剪裁要小)显示的显示名称:
我的目标是能够处理没有统一代码错误的任何电子邮件,也无需编写自定义的Unicode错误处理代码 - 这可能吗?
任何人都可以建议我可以做些什么来避免显示电子邮件地址显示名称时出现Unicode错误?