用Python解析大型xml文件 - etree.parse error

尝试使用lxml.etree.iterparse函数解析以下Python文件。用Python解析大型xml文件 - etree.parse error

“sampleoutput.xml”

<item> 
    <title>Item 1</title> 
    <desc>Description 1</desc> 
</item> 
<item> 
    <title>Item 2</title> 
    <desc>Description 2</desc> 
</item>

我试图从Parsing Large XML file with Python lxml and Iterparse

代码的etree.iterparse（MYFILE）调用我做MYFILE =打开（“/用户/埃里克/桌面/ wikipedia_map前/sampleoutput.xml","r“）

但事实证明了以下错误

Traceback (most recent call last): 
    File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module> 
    for event, elem in context : 
    File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565) 
    File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086) 
    File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712) 
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1

有什么想法？谢谢！

来源

2012-07-09 ejang

难道说你的XML文件的格式不正确？它不包含'<？xml'标记或根元素。 – C0deH4cker 2012-07-09 04:33:36

我不知道lxml，但你的例子不是有效的XML。一个XML文档必须有一个根元素。你的不是。 – 2012-07-09 04:35:06

您需要一个根元素，而不仅仅是子节点。 – pinkdawn 2012-07-09 05:39:11

问题是，如果XML没有完全一个顶级标记，则XML格式不正确。您可以通过将整个文档包装在<items></items>标签中来修复您的示例。您还需要使用<desc/>标签来匹配您正在使用的查询（description）。

在以下文件产生与您现有的代码正确的结果：

<items> 
    <item> 
    <title>Item 1</title> 
    <description>Description 1</description> 
    </item> 
    <item> 
    <title>Item 2</title> 
    <description>Description 2</description> 
    </item> 
</items>

来源

2012-07-09 05:01:29 sblom

如果文件太大，我不想将它加载到内存中，那么我使用iterparse解析它？ – 2017-01-18 20:05:53

据我所知，xml.etree.ElementTree通常希望XML文件包含一个“根”元素，即包含完整文档结构的一个XML标签。从你发布的错误消息中，我会假设这里也是这个问题：

'线5'代表第二个<item>标记，所以我猜Python会抱怨在假定的根元素后面有更多的数据即第一个<item>标签）被关闭。

来源

2012-07-09 04:39:49

用Python解析大型xml文件 - etree.parse error

回答

相关问题