2013-12-21 23 views
1

我已经继承了一些我需要在Python中处理的xml。我正在使用xml.etree.cElementTree,我在将空元素后面的文本与空元素的标记关联时遇到了一些问题。这个xml比我下面粘贴的要复杂得多,但我简化了它,使问题更加清晰(我希望!)。如何将xml文本与Python中前面的空元素相关联?

我想有其结果是这样一个字典:

期望结果

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'} 

元组还可以包含字符串(例如,('9', '1'))。我真的不在乎这个早期阶段。

这里是XML:

test1.xml

<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> <!-- The empty element --> 
     As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
    <section num="2"/> <!-- Another empty element --> 
     poverty, itch, and pride. 
    </p> 
</div1> 

我曾尝试

尝试1

>>> import xml.etree.cElementTree as ET 
>>> tree = ET.parse('test1.xml') 
>>> root = tree.getroot() 
>>> chapter = root.attrib['num'] 
>>> d = dict() 
>>> for p in root: 
    for section in p: 
     d[(int(chapter), int(section.attrib['num']))] = section.text 


>>> d 
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty 

尝试2

>>> for p in root: 
    for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense 
     d[(int(chapter), int(section.attrib['num']))] = text.strip() 


>>> d 
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''} 

正如你可以在后面的尝试看,pp.itertext()是两个不同的长度。 (9, 2)的值是我试图与关键字(9, 1)关联的值,而我想与(9, 2)关联的值甚至没有出现在d中(因为zip截断了较长的p.itertext())。

任何帮助,将不胜感激。提前致谢。

回答

1

您是否尝试过使用.tail

import xml.etree.cElementTree as ET 

txt = """<div1 type="chapter" num="9"> 
     <p> 
      <section num="1"/> <!-- The empty element --> 
      As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
      <section num="2"/> <!-- Another empty element --> 
      poverty, itch, and pride. 
     </p> 
     </div1>""" 
root = ET.fromstring(txt) 
for p in root: 
    for s in p: 
     print s.attrib['num'], s.tail 
+0

辉煌。像魅力一样工作。谢谢。 – user3079064

0

我会用BeautifulSoup此:

from bs4 import BeautifulSoup 

html_doc = """<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> 
     As they say, A student has usually three maladies: 
    <section num="2"/> 
     poverty, itch, and pride. 
    </p> 
</div1>""" 

soup = BeautifulSoup(html_doc) 

result = {} 
for chapter in soup.find_all(type='chapter'): 
    for section in chapter.find_all('section'): 
     result[(chapter['num'], section['num'])] = section.next_sibling.strip() 

import pprint 
pprint.pprint(result) 

此打印:

{(u'9', u'1'): u'As they say, A student has usually three maladies:', 
(u'9', u'2'): u'poverty, itch, and pride.'} 
相关问题