2016-07-14 87 views
-1

我想解析从EPO-OPS收到的这个简单的文档。为什么XML解析如此困难?

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="2"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1"> 
      <abstract lang="en"> 
       <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p> 
      </abstract> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data> 

我做

import xml.etree.ElementTree as ET 
root = ET.parse('pyre.xml').getroot() 
for child in root: 
    for kid in child: 
     for abst in kid: 
      for p in abst: 
       print (p.text) 

是否有类似的任何简单的方法来JSON,如:

print (root.exchange-documents.exchange-document.abstract.p.text) 

回答

2

它与BeautifulSoup多容易得多。试试这个:

from bs4 import BeautifulSoup 

xml = """<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="2"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1"> 
      <abstract lang="en"> 
       <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p> 
      </abstract> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data>""" 

“龙” 的解决方案:

soup = BeautifulSoup(xml) 
for sub_cell_tag in soup.find_all('abstract'): 
    print(sub_cell_tag.text) 

如果你到一个衬垫:

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')])) 
+0

这beautifulsoup:https://pypi.python.org/pypi/beautifulsoup4? – Rahul

+0

@ Scripting.FileSystemObject就是这样。 – poke

+0

是的,你可以在这里找到它的文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/ –

2

您可以使用XPath表达式与ElementTree的。需要注意的是,因为你有xmlns定义的全局XML命名空间,你需要指定网址:

tree = ElementTree.parse(…) 

namespaces = { 'ns': 'http://www.epo.org/exchange' } 
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces) 
for paragraph in paragraphs: 
    print(paragraph.text) 
+0

我们不能通过使用getroot()来摆脱名称空间吗? – Rahul

+0

不,ElementTree在其核心内建有名称空间,并且将(正确)尊重这些名称空间。您可以在解析后移除命名空间[在本答案中讨论](http://stackoverflow.com/a/25920989/216074),但没有内置的解决方案可以忽略它们。 – poke