如何将xml文本与Python中前面的空元素相关联？

我已经继承了一些我需要在Python中处理的xml。我正在使用xml.etree.cElementTree，我在将空元素后面的文本与空元素的标记关联时遇到了一些问题。这个xml比我下面粘贴的要复杂得多，但我简化了它，使问题更加清晰（我希望！）。如何将xml文本与Python中前面的空元素相关联？

我想有其结果是这样一个字典：

期望结果

{(9, 1): 'As they say, A student has usually three maladies:', (9, 2): 'poverty, itch, and pride.'}

元组还可以包含字符串（例如，('9', '1')）。我真的不在乎这个早期阶段。

这里是XML：

test1.xml

<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> <!-- The empty element --> 
     As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
    <section num="2"/> <!-- Another empty element --> 
     poverty, itch, and pride. 
    </p> 
</div1>

我曾尝试

尝试1

>>> import xml.etree.cElementTree as ET 
>>> tree = ET.parse('test1.xml') 
>>> root = tree.getroot() 
>>> chapter = root.attrib['num'] 
>>> d = dict() 
>>> for p in root: 
    for section in p: 
     d[(int(chapter), int(section.attrib['num']))] = section.text 


>>> d 
{(9, 2): None, (9, 1): None} # This of course makes sense, since the elements are empty

尝试2

>>> for p in root: 
    for section, text in zip(p, p.itertext()): # unfortunately, p and p.itertext() are two different lengths, which also makes sense 
     d[(int(chapter), int(section.attrib['num']))] = text.strip() 


>>> d 
{(9, 2): 'As they say, A student has usually three maladies:', (9, 1): ''}

正如你可以在后面的尝试看，p和p.itertext()是两个不同的长度。 (9, 2)的值是我试图与关键字(9, 1)关联的值，而我想与(9, 2)关联的值甚至没有出现在d中（因为zip截断了较长的p.itertext()）。

任何帮助，将不胜感激。提前致谢。

来源

2013-12-21 user3079064

您是否尝试过使用.tail？

import xml.etree.cElementTree as ET 

txt = """<div1 type="chapter" num="9"> 
     <p> 
      <section num="1"/> <!-- The empty element --> 
      As they say, A student has usually three maladies: <!-- Here lies the trouble --> 
      <section num="2"/> <!-- Another empty element --> 
      poverty, itch, and pride. 
     </p> 
     </div1>""" 
root = ET.fromstring(txt) 
for p in root: 
    for s in p: 
     print s.attrib['num'], s.tail

来源

2013-12-21 21:48:45 ChrisP

辉煌。像魅力一样工作。谢谢。 – user3079064

我会用BeautifulSoup此：

from bs4 import BeautifulSoup 

html_doc = """<div1 type="chapter" num="9"> 
    <p> 
    <section num="1"/> 
     As they say, A student has usually three maladies: 
    <section num="2"/> 
     poverty, itch, and pride. 
    </p> 
</div1>""" 

soup = BeautifulSoup(html_doc) 

result = {} 
for chapter in soup.find_all(type='chapter'): 
    for section in chapter.find_all('section'): 
     result[(chapter['num'], section['num'])] = section.next_sibling.strip() 

import pprint 
pprint.pprint(result)

此打印：

{(u'9', u'1'): u'As they say, A student has usually three maladies:', 
(u'9', u'2'): u'poverty, itch, and pride.'}

来源

2013-12-21 21:59:39 jterrace

如何将xml文本与Python中前面的空元素相关联？

回答

相关问题