2010-09-21 111 views
0

我需要避免在解析文本文件时在xml树中创建双分支。比方说,文本文件如下(行的顺序是随机的):从Python文本文件创建xml树

BRANCH1:branch11:消息11
BRANCH1:branch12:message12
BRANCH2:branch21:message21
BRANCH2:branch22:message22

所以得到的xml树应该有一个有两个分支的根。这两个分支都有两个子分支。我用它来解析这个文本文件的Python代码如下:

import string 
fh = open ('xmlbasic.txt', 'r') 
allLines = fh.readlines() 
fh.close() 
import xml.etree.ElementTree as ET 
root = ET.Element('root') 

for line in allLines: 
    tempv = line.split(':') 
    branch1 = ET.SubElement(root, tempv[0]) 
    branch2 = ET.SubElement(branch1, tempv[1]) 
    branch2.text = tempv[2] 

tree = ET.ElementTree(root) 
tree.write('xmlbasictree.xml') 

这段代码的问题是,在XML树的一个分支与来自文本文件的每一行创建。

任何建议如何避免在xml树中创建另一个分支如果具有此名称的分支已经存在?

回答

1
with open("xmlbasic.txt") as lines_file: 
    lines = lines_file.read() 

import xml.etree.ElementTree as ET 

root = ET.Element('root') 

for line in lines: 
    head, subhead, tail = line.split(":") 

    head_branch = root.find(head) 
    if not head_branch: 
     head_branch = ET.SubElement(root, head) 

    subhead_branch = head_branch.find(subhead) 
    if not subhead_branch: 
     subhead_branch = ET.SubElement(branch1, subhead) 

    subhead_branch.text = tail 

tree = ET.ElementTree(root) 
ET.dump(tree) 

的逻辑很简单 - 你已经提到它在你的问题!在创建树之前,您只需检查树中是否已存在树枝。

请注意,这可能是低效的,因为您正在搜索每一行的整个树。这是因为ElementTree不是为了唯一而设计的。


如果您需要的速度(你可能没有,尤其是对于短小的树!),更有效的方法是使用一个defaultdict将其转换为ElementTree之前树形结构存储。

import collections 
import xml.etree.ElementTree as ET 

with open("xmlbasic.txt") as lines_file: 
    lines = lines_file.read() 

root_dict = collections.defaultdict(dict) 
for line in lines: 
    head, subhead, tail = line.split(":") 
    root_dict[head][subhead] = tail 

root = ET.Element('root') 
for head, branch in root_dict.items(): 
    head_element = ET.SubElement(root, head) 
    for subhead, tail in branch.items(): 
     ET.SubElement(head_element,subhead).text = tail 

tree = ET.ElementTree(root) 
ET.dump(tree) 
+0

谢谢,这个和其他答案都很好,但我会坚持defaultdict,因为实际上文本和xml文件相当大。 – bitman 2010-09-21 11:54:26

0

沿着这些线?你保持分支的水平在字典中重用。

b1map = {} 

for line in allLines: 
    tempv = line.split(':') 
    branch1 = b1map.get(tempv[0]) 
    if branch1 is None: 
     branch1 = b1map[tempv[0]] = ET.SubElement(root, tempv[0]) 
    branch2 = ET.SubElement(branch1, tempv[1]) 
    branch2.text = tempv[2]