2012-07-16 110 views
0

我想用Python Etree解析器解析和比较2个XML文件,如下所示:Python用Etree替换XML内容

我有2个带有数据加载的XML文件。一个是英文(源文件),另一个是相应的法文翻译(目标文件)。 如:

源文件:

<AB> 
    <CD/> 
    <EF> 

    <GH> 
     <id>123</id> 
     <IJ>xyz</IJ> 
     <KL>DOG</KL> 
     <MN>dogs/dog</MN> 
     some more tags and info on same level 
     <metadata> 
     <entry> 
      <cl>Translation</cl> 
      <cl>English:dog/dogs</cl> 
     </entry> 
     <entry> 
      <string>blabla</string> 
      <string>blabla</string> 
     </entry> 
      some more strings and entries 
     </metadata> 
    </GH> 

    </EF> 
    <stuff/> 
    <morestuff/> 
    <otherstuff/> 
    <stuffstuff/> 
    <blubb/> 
    <bla/> 
    <blubbbla>8</blubbla> 
</AB> 

目标文件看起来完全一样,但在一些地方没有文字:

<MN>chiens/chien</MN> 
some more tags and info on same level 
<metadata> 
    <entry> 
    <cl>Translation</cl> 
    <cl></cl> 
    </entry> 

法国的目标文件有一个空的跨语言只要2个宏具有相同的ID,我想从英文源文件中输入信息。 我已经编写了一些代码,其中我用一个唯一的标记名称替换了字符串标记名称,以便识别跨语言引用。现在我想比较两个文件,如果两个宏具有相同的ID,则将法文文件中的空引用与英文文件中的信息进行交换。我之前尝试过minidom解析器,但卡住了,现在想试试Etree。我几乎没有任何关于编程的知识,并且很难找到它。 这里是我到目前为止的代码:

macros = ElementTree.parse(english) 

    for tag in macros.getchildren('macro'): 
     id_ = tag.find('id') 
     data = tag.find('cl') 
     id_dict[id_.text] = data.text 

    macros = ElementTree.parse(french) 

    for tag in macros.getchildren('macro'): 
     id_ = tag.find('id') 
     target = tag.find('cl') 
     if target.text.strip() == '': 
     target.text = id_dict[id_.text] 

    print (ElementTree.tostring(macros)) 

我比这个无能和阅读其他职位更混淆了我,甚至更多。如果有人能够启发我,我将非常感激:-)

+0

最好附加更复杂的样本以帮助解决方案更正确。 – pepr 2012-07-17 08:04:13

回答

1

可能有更多细节需要澄清。这里是一些带有一些调试打印的示例,显示了这个想法。它假定这两个文件具有完全相同的结构,以及你想要去的只有一个级别的根目录下:

import xml.etree.ElementTree as etree 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

# Get the root elements, as they support iteration 
# through their children (direct descendants) 
english_root = english_tree.getroot() 
french_root = french_tree.getroot() 

# Iterate through the direct descendants of the root 
# elements in both trees in parallel. 
for en, fr in zip(english_root, french_root): 
    assert en.tag == fr.tag # check for the same structure 
    if en.tag == 'id': 
     assert en.text == fr.text # check for the same id 

    elif en.tag == 'string': 
     if fr.text is None: 
      fr.text = en.text 
      print en.text  # displaying what was replaced 

etree.dump(french_tree) 

对于文件的更复杂的结构,通过节点的直接子循环可取代树中所有元素的迭代。如果文件的结构是完全一样的,下面的代码将工作:

import xml.etree.ElementTree as etree 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

for en, fr in zip(english_tree.iter(), french_tree.iter()): 
    assert en.tag == fr.tag  # check if the structure is the same 
    if en.tag == 'id': 
     assert en.text == fr.text # identification must be the same 
    elif en.tag == 'string': 
     if fr.text is None: 
      fr.text = en.text 
      print en.text   # display the inserted text 

# Write the result to the output file. 
with open('fr2.xml', 'w') as fout: 
    fout.write(etree.tostring(french_tree.getroot())) 

但是,它只能在情况下,当这两个文件具有完全相同的结构。让我们按照手动完成任务时使用的算法。首先,我们需要找到空的法文翻译。然后它应该由具有相同标识的GH元素的英文翻译代替。在搜索元素的情况下使用XPath表达式的子集:

import xml.etree.ElementTree as etree 

def find_translation(tree, id_): 
    # Search fot the GH element with the given identification, and return 
    # its translation if found. Otherwise None is returned implicitly. 
    for gh in tree.iter('GH'): 
     id_elem = gh.find('./id') 
     if id_ == id_elem.text: 
      # The related GH element found. 
      # Find metadata entry, extract the translation. 
      # Warning! This is simplification for the fixed position 
      # of the Translation entry. 
      me = gh.find('./metadata/entry') 
      assert len(me) == 2  # metadata/entry has two elements 
      cl1 = me[0] 
      assert cl1.text == 'Translation' 
      cl2 = me[1] 

      return cl2.text 


# Body of the program. -------------------------------------------------- 

english_tree = etree.parse('en.xml') 
french_tree = etree.parse('fr.xml') 

for gh in french_tree.iter('GH'): # iterate through the GH elements only 
    # Get the identification of the GH section 
    id_elem = gh.find('./id')  
    id_ = id_elem.text 

    # Find and check the metadata entry, extract the French translation. 
    # Warning! This is simplification for the fixed position of the Translation 
    # entry. 
    me = gh.find('./metadata/entry') 
    assert len(me) == 2  # metadata/entry has two elements 
    cl1 = me[0] 
    assert cl1.text == 'Translation' 
    cl2 = me[1] 
    fr_translation = cl2.text 

    # If the French translation is empty, put there the English translation 
    # from the related element. 
    if cl2.text is None: 
     cl2.text = find_translation(english_tree, id_) 


with open('fr2.xml', 'w') as fout: 
    fout.write(etree.tostring(french_tree.getroot()).decode('utf-8')) 
+0

现在是XPath的时候了(标准'xml.etree.ElementTree'只支持它的一些特性,但它们对于这种情况足够强大)。尝试修改后的答案(最后一部分)。修复输入/输出文件的名称。然后,我建议在这里清理注释,以使其更易于阅读和有用。 – pepr 2012-07-17 14:26:12

+0

正确....如果翻译条目不固定,我可以将翻译周围的“条目”标签重命名为独特的东西,并以此方式找到它,或者不建议这样做(因为我尝试了这种方法,但它不起作用,但我想知道这是不是正确的方向?) – Kaly 2012-07-17 15:18:41

+0

标记重命名可能不应该在一般情况下完成。如果标签/元素具有其自己的特殊名称会更好。这种方式''不是一个好例子。但我明白,用户可能会决定以交互方式插入该列,而底层软件无法猜测用户想要的内容。 – pepr 2012-07-17 16:04:53