Python LXML iterparse与嵌套元素

我想检索XML文件内的特定元素的内容。但是，在XML元素中，还有其他XML元素，这些元素会破坏父标记内的正确提取内容。一个例子：Python LXML iterparse与嵌套元素

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text') 
for event, element in context: 
    print element.text

这导致：

a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and; 
None

然而，例如， '保护性使用均匀..' 被错过。看来，“索赔文本”中的每个元素都有其他内在因素被忽略。我应该如何更改XML的解析以获取所有声明？

感谢

我刚刚与“普通” SAX解析器的方法解决了这个问题：

class SimpleXMLHandler(object): 

    def __init__(self): 
    self.buffer = '' 
    self.claim = 0 

    def start(self, tag, attributes): 
    if tag == 'claim-text': 
     if self.claim == 0: 
     self.buffer = '' 
     self.claim = 1 

    def data(self, data): 
    if self.claim == 1: 
     self.buffer += data 

    def end(self, tag): 
    if tag == 'claim-text': 
     print self.buffer 
     self.claim = 0 

    def close(self): 
    pass

来源

2011-04-20 labrassbandito

你可以使用XPath找到并串连直属各<claim-text>节点的所有文本节点，像这样：

from StringIO import StringIO 
from lxml import etree 
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text') 
for event, element in context: 
    print ''.join(element.xpath('text()'))

，其输出：

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: 
a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;

来源

2011-04-21 00:52:46 jsw

Python LXML iterparse与嵌套元素

回答

相关问题