2012-08-13 41 views
0

我使用Python和ElementTree来解析XML文件。我希望能够列出包含所有CD信息的字典列表。稍后我可以使用此列表来收集信息,例如显示来自美国的CD的标题。下面的代码正在工作,但如果YEAR标签不是CD的最后一个标签,则很容易被破坏。我怎样才能重写这段代码,使标签可以以任何顺序?在Python中使用元素树进行XML解析

from xml.etree.ElementTree import ElementTree 

f = open("cd_catalog.xml") 
tree = ElementTree() 
tree.parse(f) 

catalog = [] 
cd = {} 
for node in tree.iter(): 
    if node.tag != "CD" and node.tag != "CATALOG": 
     tagtext = (node.tag,node.text), 
     cd.update(tagtext) 
    if node.tag == "YEAR": 
     catalog.append(cd) 
     cd = {} 

for cd in catalog: 
    if cd["COUNTRY"] == "USA": 
     print("The cd named {0} is from USA".format(cd["TITLE"])) 

2项的XML文件:

<CATALOG> 
    <CD> 
     <TITLE>Empire Burlesque</TITLE> 
     <ARTIST>Bob Dylan</ARTIST> 
     <COUNTRY>USA</COUNTRY> 
     <COMPANY>Columbia</COMPANY> 
     <PRICE>10.90</PRICE> 
     <YEAR>1985</YEAR> 
    </CD> 
    <CD> 
     <TITLE>Hide your heart</TITLE> 
     <ARTIST>Bonnie Tyler</ARTIST> 
     <COUNTRY>UK</COUNTRY> 
     <COMPANY>CBS Records</COMPANY> 
     <PRICE>9.90</PRICE> 
     <YEAR>1988</YEAR> 
    </CD> 
</CATALOG> 

回答

2

一种方式来重写你的XML解析代码如下。在这个例子中,我定义了一个循环遍历根元素的所有CD元素的生成器(我不检查这是否为CATALOG元素,尽管您可以添加该元素)。该生成器将每个CD元素的所有子元素作为字典返回。

使用发电机比建造所有CD元素的字典更有效,特别是如果你的XML文件是非常大的,因为你永远只存储单个CD元素在内存中。

import xml.etree.ElementTree as etree 

def get_cd(element): 
    try: 
     for el in element.iter(tag='CD') 
      yield get_cd_info(el) 
    except AttributeError: 
     # Python < 2.7 
     for el in element.getiterator(tag='CD') 
      yield get_cd_info(el) 

def get_cd_info(element): 
    return {'title':element.findtext('TITLE'), 
     'artist':element.findtext('ARTIST'), 
     'country':element.findtext('COUNTRY'), 
     'company':element.findtext('COMPANY'), 
     'price':element.findtext('PRICE), 
     'year':element.findtext('YEAR')} 

以下是在行动的上述方法:

s = '''<CATALOG> 
    <CD> 
     <TITLE>Empire Burlesque</TITLE> 
     <ARTIST>Bob Dylan</ARTIST> 
     <COUNTRY>USA</COUNTRY> 
     <COMPANY>Columbia</COMPANY> 
     <PRICE>10.90</PRICE> 
     <YEAR>1985</YEAR> 
    </CD> 
    <CD> 
     <TITLE>Hide your heart</TITLE> 
     <ARTIST>Bonnie Tyler</ARTIST> 
     <COUNTRY>UK</COUNTRY> 
     <COMPANY>CBS Records</COMPANY> 
     <PRICE>9.90</PRICE> 
     <YEAR>1988</YEAR> 
    </CD> 
</CATALOG> 
''' 

e = etree.fromstring(s) 

for cd in get_cd(e): 
    if cd['country'] == 'USA': 
     print('The cd "{0}" is from the USA.'.format(cd['title'])) 

# prints 'The cd "Empire Burlesque" is from the USA.' 
1

未经测试:

.... 
for CD in tree.findall('cd'): 
    for node in CD.finditer(): 
     print node.tag # TITLE, ARTIST, PRICE etc. 

.....