我找到了解决方案。首先,我创建了一个新的编解码器错误处理程序,然后使用修补程序ElementTree._get_writer()来使用新的错误处理程序。看起来像:
from xml.etree import ElementTree
import io
import contextlib
import codecs
def lower_first(s):
return s[:1].lower() + s[1:] if s else ''
def html_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in exc.object[exc.start:exc.end]:
s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper()))
return ''.join(s), exc.end
else:
raise TypeError("can't handle %s" % exc.__name__)
codecs.register_error('html_replace', html_replace)
# monkey patch this python function to prevent it from using xmlcharrefreplace
@contextlib.contextmanager
def _get_writer(file_or_filename, encoding):
# returns text write method and release all resources after using
try:
write = file_or_filename.write
except AttributeError:
# file_or_filename is a file name
if encoding == "unicode":
file = open(file_or_filename, "w")
else:
file = open(file_or_filename, "w", encoding=encoding,
errors="html_replace")
with file:
yield file.write
else:
# file_or_filename is a file-like object
# encoding determines if it is a text or binary writer
if encoding == "unicode":
# use a text writer as is
yield write
else:
# wrap a binary writer with TextIOWrapper
with contextlib.ExitStack() as stack:
if isinstance(file_or_filename, io.BufferedIOBase):
file = file_or_filename
elif isinstance(file_or_filename, io.RawIOBase):
file = io.BufferedWriter(file_or_filename)
# Keep the original file open when the BufferedWriter is
# destroyed
stack.callback(file.detach)
else:
# This is to handle passed objects that aren't in the
# IOBase hierarchy, but just have a write method
file = io.BufferedIOBase()
file.writable = lambda: True
file.write = write
try:
# TextIOWrapper uses this methods to determine
# if BOM (for UTF-16, etc) should be added
file.seekable = file_or_filename.seekable
file.tell = file_or_filename.tell
except AttributeError:
pass
file = io.TextIOWrapper(file,
encoding=encoding,
errors='html_replace',
newline="\n")
# Keep the original file open when the TextIOWrapper is
# destroyed
stack.callback(file.detach)
yield file.write
ElementTree._get_writer = _get_writer
这是多么令人讨厌。对不起,我不知道ElementTree足以回答你的问题。 (FWIW,我的电子阅读器的十进制比十六进制更好,所以我有相反的问题)。如果您没有找到强制使用十六进制的方法,使用正则表达式很容易将十进制实体转换为十六进制。 OTOH,在当今时代,大多数设备都具有良好的UTF-8支持,因此您可以将这些实体转换为Unicode,并将输出文件编码为UTF-8。 –
我不想用不同的编码或不同的代码点修改数据库文件的格式。我希望它保持与Rhytmbox的格式完全兼容。 – moorepants
这是有道理的。 OTOH,如果Rhythmbox不为其XML文件使用UTF-8,我会感到惊讶。当然,ASCII是UTF-8的一个子集,因此,即使Rhythmbox支持UTF-8,也可以使您的XML成为严格的ASCII码。 –