2011-04-20 76 views
1

我想检索XML文件内的特定元素的内容。但是,在XML元素中,还有其他XML元素,这些元素会破坏父标记内的正确提取内容。一个例子:Python LXML iterparse与嵌套元素

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text') 
for event, element in context: 
    print element.text 

这导致:

a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and; 
None 

然而,例如, '保护性使用均匀..' 被错过。看来,“索赔文本”中的每个元素都有其他内在因素被忽略。我应该如何更改XML的解析以获取所有声明?

感谢

我刚刚与“普通” SAX解析器的方法解决了这个问题:

class SimpleXMLHandler(object): 

    def __init__(self): 
    self.buffer = '' 
    self.claim = 0 

    def start(self, tag, attributes): 
    if tag == 'claim-text': 
     if self.claim == 0: 
     self.buffer = '' 
     self.claim = 1 

    def data(self, data): 
    if self.claim == 1: 
     self.buffer += data 

    def end(self, tag): 
    if tag == 'claim-text': 
     print self.buffer 
     self.claim = 0 

    def close(self): 
    pass 

回答

2

你可以使用XPath找到并串连直属各<claim-text>节点的所有文本节点,像这样:

from StringIO import StringIO 
from lxml import etree 
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>''' 

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text') 
for event, element in context: 
    print ''.join(element.xpath('text()')) 

,其输出:

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: 
a. an upper body garment and a separate lower body garment 
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;