我会从开始的问题:“有没有办法,我可以使用另一种解析器可能不太严格,并允许UTF-8字符?”
所有XML解析器都将接受以UTF-8编码的数据。实际上,UTF-8是默认编码。
一个XML文件可能有这样的声明开始:
`<?xml version="1.0" encoding="UTF-8"?>`
或像这样: <?xml version="1.0"?>
或没有申报在所有...在每种情况下的解析器将文档使用UTF解码-8。
但是,您的数据不是以UTF-8编码的......它可能是Windows-1252又名cp1252。
如果编码不是UTF-8,则创建者应该包含一个声明(或者接收者可以预先设置一个)或者接收者可以将数据转码为UTF-8。以下展示什么可行,什么不行:
>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
不是欧洲人,我们绝对是在美国。我没有这样做,我保证:) – Kekoa 2009-07-16 21:37:35