Python用Etree替换XML内容

我想用Python Etree解析器解析和比较2个XML文件，如下所示：Python用Etree替换XML内容

我有2个带有数据加载的XML文件。一个是英文（源文件），另一个是相应的法文翻译（目标文件）。如：

源文件：

<AB> 
    <CD/> 
    <EF> 

    <GH> 
     <id>123</id> 
     <IJ>xyz</IJ> 
     <KL>DOG</KL> 
     <MN>dogs/dog</MN> 
     some more tags and info on same level 
     <metadata> 
     <entry> 
      <cl>Translation</cl> 
      <cl>English:dog/dogs</cl> 
     </entry> 
     <entry> 
      <string>blabla</string> 
      <string>blabla</string> 
     </entry> 
      some more strings and entries 
     </metadata> 
    </GH> 

    </EF> 
    <stuff/> 
    <morestuff/> 
    <otherstuff/> 
    <stuffstuff/> 
    <blubb/> 
    <bla/> 
    <blubbbla>8</blubbla> 
</AB>

目标文件看起来完全一样，但在一些地方没有文字：

<MN>chiens/chien</MN> 
some more tags and info on same level 
<metadata> 
    <entry> 
    <cl>Translation</cl> 
    <cl></cl> 
    </entry>

法国的目标文件有一个空的跨语言只要2个宏具有相同的ID，我想从英文源文件中输入信息。我已经编写了一些代码，其中我用一个唯一的标记名称替换了字符串标记名称，以便识别跨语言引用。现在我想比较两个文件，如果两个宏具有相同的ID，则将法文文件中的空引用与英文文件中的信息进行交换。我之前尝试过minidom解析器，但卡住了，现在想试试Etree。我几乎没有任何关于编程的知识，并且很难找到它。这里是我到目前为止的代码：

macros = ElementTree.parse(english) 

    for tag in macros.getchildren('macro'): 
     id_ = tag.find('id') 
     data = tag.find('cl') 
     id_dict[id_.text] = data.text 

    macros = ElementTree.parse(french) 

    for tag in macros.getchildren('macro'): 
     id_ = tag.find('id') 
     target = tag.find('cl') 
     if target.text.strip() == '': 
     target.text = id_dict[id_.text] 

    print (ElementTree.tostring(macros))

我比这个无能和阅读其他职位更混淆了我，甚至更多。如果有人能够启发我，我将非常感激:-)

来源

2012-07-16 Kaly

最好附加更复杂的样本以帮助解决方案更正确。 – pepr 2012-07-17 08:04:13

可能有更多细节需要澄清。这里是一些带有一些调试打印的示例，显示了这个想法。它假定这两个文件具有完全相同的结构，以及你想要去的只有一个级别的根目录下：

import xml.etree.ElementTree as etree 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

# Get the root elements, as they support iteration 
# through their children (direct descendants) 
english_root = english_tree.getroot() 
french_root = french_tree.getroot() 

# Iterate through the direct descendants of the root 
# elements in both trees in parallel. 
for en, fr in zip(english_root, french_root): 
    assert en.tag == fr.tag # check for the same structure 
    if en.tag == 'id': 
     assert en.text == fr.text # check for the same id 

    elif en.tag == 'string': 
     if fr.text is None: 
      fr.text = en.text 
      print en.text  # displaying what was replaced 

etree.dump(french_tree)

对于文件的更复杂的结构，通过节点的直接子循环可取代树中所有元素的迭代。如果文件的结构是完全一样的，下面的代码将工作：

import xml.etree.ElementTree as etree 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

for en, fr in zip(english_tree.iter(), french_tree.iter()): 
    assert en.tag == fr.tag  # check if the structure is the same 
    if en.tag == 'id': 
     assert en.text == fr.text # identification must be the same 
    elif en.tag == 'string': 
     if fr.text is None: 
      fr.text = en.text 
      print en.text   # display the inserted text 

# Write the result to the output file. 
with open('fr2.xml', 'w') as fout: 
    fout.write(etree.tostring(french_tree.getroot()))

但是，它只能在情况下，当这两个文件具有完全相同的结构。让我们按照手动完成任务时使用的算法。首先，我们需要找到空的法文翻译。然后它应该由具有相同标识的GH元素的英文翻译代替。在搜索元素的情况下使用XPath表达式的子集：

import xml.etree.ElementTree as etree 

def find_translation(tree, id_): 
    # Search fot the GH element with the given identification, and return 
    # its translation if found. Otherwise None is returned implicitly. 
    for gh in tree.iter('GH'): 
     id_elem = gh.find('./id') 
     if id_ == id_elem.text: 
      # The related GH element found. 
      # Find metadata entry, extract the translation. 
      # Warning! This is simplification for the fixed position 
      # of the Translation entry. 
      me = gh.find('./metadata/entry') 
      assert len(me) == 2  # metadata/entry has two elements 
      cl1 = me[0] 
      assert cl1.text == 'Translation' 
      cl2 = me[1] 

      return cl2.text 


# Body of the program. -------------------------------------------------- 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

for gh in french_tree.iter('GH'): # iterate through the GH elements only 
    # Get the identification of the GH section 
    id_elem = gh.find('./id')  
    id_ = id_elem.text 

    # Find and check the metadata entry, extract the French translation. 
    # Warning! This is simplification for the fixed position of the Translation 
    # entry. 
    me = gh.find('./metadata/entry') 
    assert len(me) == 2  # metadata/entry has two elements 
    cl1 = me[0] 
    assert cl1.text == 'Translation' 
    cl2 = me[1] 
    fr_translation = cl2.text 

    # If the French translation is empty, put there the English translation 
    # from the related element. 
    if cl2.text is None: 
     cl2.text = find_translation(english_tree, id_) 


with open('fr2.xml', 'w') as fout: 
    fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))

来源

2012-07-17 07:58:55 pepr

现在是XPath的时候了（标准'xml.etree.ElementTree'只支持它的一些特性，但它们对于这种情况足够强大）。尝试修改后的答案（最后一部分）。修复输入/输出文件的名称。然后，我建议在这里清理注释，以使其更易于阅读和有用。 – pepr 2012-07-17 14:26:12

正确....如果翻译条目不固定，我可以将翻译周围的“条目”标签重命名为独特的东西，并以此方式找到它，或者不建议这样做（因为我尝试了这种方法，但它不起作用，但我想知道这是不是正确的方向？） – Kaly 2012-07-17 15:18:41

标记重命名可能不应该在一般情况下完成。如果标签/元素具有其自己的特殊名称会更好。这种方式''不是一个好例子。但我明白，用户可能会决定以交互方式插入该列，而底层软件无法猜测用户想要的内容。 – pepr 2012-07-17 16:04:53

Python用Etree替换XML内容

回答

相关问题