2017-10-21 90 views
1

我已经通过ElementTree解析器将一个xml文件(Rhythmbox的数据库文件)加载到Python 3中。使用ascii编码修改树并将其写入磁盘(ElementTree.write())后,所有十六进制代码点中的ASCII十六进制字符都将转换为ASCII十进制代码点。例如下面是一个包含一个diff版权符号:如何在Python中编写ElementTree时保留ASCII十六进制代码点?

<  <copyright>&#xA9; WNYC</copyright> 
--- 
>  <copyright>&#169; WNYC</copyright> 

有什么办法来告诉Python/ElementTree的不这样做呢?我希望所有的十六进制代码保持十六进制代码点。

+0

这是多么令人讨厌。对不起,我不知道ElementTree足以回答你的问题。 (FWIW,我的电子阅读器的十进制比十六进制更好,所以我有相反的问题)。如果您没有找到强制使用十六进制的方法,使用正则表达式很容易将十进制实体转换为十六进制。 OTOH,在当今时代,大多数设备都具有良好的UTF-8支持,因此您可以将这些实体转换为Unicode,并将输出文件编码为UTF-8。 –

+0

我不想用不同的编码或不同的代码点修改数据库文件的格式。我希望它保持与Rhytmbox的格式完全兼容。 – moorepants

+0

这是有道理的。 OTOH,如果Rhythmbox不为其XML文件使用UTF-8,我会感到惊讶。当然,ASCII是UTF-8的一个子集,因此,即使Rhythmbox支持UTF-8,也可以使您的XML成为严格的ASCII码。 –

回答

1

我找到了解决方案。首先,我创建了一个新的编解码器错误处理程序,然后使用修补程序ElementTree._get_writer()来使用新的错误处理程序。看起来像:

from xml.etree import ElementTree 
import io 
import contextlib 
import codecs 


def lower_first(s): 
    return s[:1].lower() + s[1:] if s else '' 


def html_replace(exc): 
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)): 
     s = [] 
     for c in exc.object[exc.start:exc.end]: 
      s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper())) 
     return ''.join(s), exc.end 
    else: 
     raise TypeError("can't handle %s" % exc.__name__) 

codecs.register_error('html_replace', html_replace) 


# monkey patch this python function to prevent it from using xmlcharrefreplace 
@contextlib.contextmanager 
def _get_writer(file_or_filename, encoding): 
    # returns text write method and release all resources after using 
    try: 
     write = file_or_filename.write 
    except AttributeError: 
     # file_or_filename is a file name 
     if encoding == "unicode": 
      file = open(file_or_filename, "w") 
     else: 
      file = open(file_or_filename, "w", encoding=encoding, 
         errors="html_replace") 
     with file: 
      yield file.write 
    else: 
     # file_or_filename is a file-like object 
     # encoding determines if it is a text or binary writer 
     if encoding == "unicode": 
      # use a text writer as is 
      yield write 
     else: 
      # wrap a binary writer with TextIOWrapper 
      with contextlib.ExitStack() as stack: 
       if isinstance(file_or_filename, io.BufferedIOBase): 
        file = file_or_filename 
       elif isinstance(file_or_filename, io.RawIOBase): 
        file = io.BufferedWriter(file_or_filename) 
        # Keep the original file open when the BufferedWriter is 
        # destroyed 
        stack.callback(file.detach) 
       else: 
        # This is to handle passed objects that aren't in the 
        # IOBase hierarchy, but just have a write method 
        file = io.BufferedIOBase() 
        file.writable = lambda: True 
        file.write = write 
        try: 
         # TextIOWrapper uses this methods to determine 
         # if BOM (for UTF-16, etc) should be added 
         file.seekable = file_or_filename.seekable 
         file.tell = file_or_filename.tell 
        except AttributeError: 
         pass 
       file = io.TextIOWrapper(file, 
             encoding=encoding, 
             errors='html_replace', 
             newline="\n") 
       # Keep the original file open when the TextIOWrapper is 
       # destroyed 
       stack.callback(file.detach) 
       yield file.write 

ElementTree._get_writer = _get_writer 
+0

我没有仔细研究过你的代码(我需要更多地了解ElementTree才能完全理解它),但是你可以将'html_replace'的核心代码简化为:'s.append('&#x%X;' ord(c))',它既更紧凑又更快速。 –

相关问题