2015-07-13 78 views
0

我需要在scrapy中解析非常大的xml。这是一些什么样的,获得SAXParseException格式不正确(无效标记),无法解决问题

<Result> 
    <Node> 
     <browseNodeId>306533011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">temperature-controllers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Temperature Controllers</browseNodeName> 
     <browseNodeStoreContextName>Temperature Controllers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,306533011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>TEMPERATURE_CONTROLLER</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>9931457011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">industrial-and-scientific-temperature-indicators</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Temperature Indicators</browseNodeName> 
     <browseNodeStoreContextName>Temperature Indicators</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,9931457011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Indicators</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>5006547011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">industrial-temperature-sensors</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Temperature Probes & Sensors</browseNodeName> 
     <browseNodeStoreContextName>Temperature Probes & Sensors</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,5006547011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Probes & Sensors</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>9931455011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">thermal-imagers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Thermal Imagers</browseNodeName> 
     <browseNodeStoreContextName>Thermal Imagers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,9931455011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermal Imagers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>393280011</browseNodeId> 
     <browseNodeAttributes count="0"/> 
     <browseNodeName>Thermometers</browseNodeName> 
     <browseNodeStoreContextName>Thermometers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,393280011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers</browsePathByName> 
     <hasChildren>true</hasChildren> 
     <childNodes count="4"> 
      <id>393282011</id> 
      <id>393284011</id> 
      <id>393283011</id> 
      <id>9931459011</id> 
     </childNodes> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>393282011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">industrial-and-scientific-dial-thermometers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Dial Thermometers</browseNodeName> 
     <browseNodeStoreContextName>Dial Thermometers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,393280011,393282011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Dial Thermometers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>393284011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">science-lab-digital-thermometers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Digital Thermometers</browseNodeName> 
     <browseNodeStoreContextName>Lab Digital Thermometers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,393280011,393284011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Digital Thermometers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>LAB_SUPPLY</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>393283011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">industrial-and-scientific-glass-thermometers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Glass Thermometers</browseNodeName> 
     <browseNodeStoreContextName>Glass Thermometers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,393280011,393283011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Glass Thermometers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
    <Node> 
     <browseNodeId>9931459011</browseNodeId> 
     <browseNodeAttributes count="1"> 
      <attribute name="item_type_keyword">infrared-thermometers</attribute> 
     </browseNodeAttributes> 
     <browseNodeName>Infrared Thermometers</browseNodeName> 
     <browseNodeStoreContextName>Infrared Thermometers</browseNodeStoreContextName> 
     <browsePathById>16310091,16310161,256409011,5006566011,393280011,9931459011</browsePathById> 
     <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Infrared Thermometers</browsePathByName> 
     <hasChildren>false</hasChildren> 
     <childNodes count="0"/> 
     <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> 
     <refinementsInformation count="0"/> 
    </Node> 
</Result> 

这是给我xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)错误。由于xml文件的大小非常大,我不能选择替换每个&符号。

在这一刻我还没有实施它使用scrapy。尽管下面提供了一个简单的参考类。如果不替换每一个&符号,这怎么可能会发生问题?

import xml.sax 


class ABContentHandler(xml.sax.ContentHandler): 
    def __init__(self): 
     xml.sax.ContentHandler.__init__(self) 

    def startElement(self, name, attrs): 
     print("startElement '" + name + "'") 
     if name == "address": 
      print("\tattribute type='" + attrs.getValue("type") + "'") 

    def endElement(self, name): 
     print("endElement '" + name + "'") 

    def characters(self, content): 
     print("characters '" + content + "'") 

def main(sourceFileName): 
    source = open(sourceFileName) 
    xml.sax.parse(source, ABContentHandler()) 

if __name__ == "__main__": 
    main("nodes.xml") 

输出

startElement 'Result' 
characters ' 
' 
characters ' ' 
startElement 'Node' 
characters ' 
' 
characters '  ' 
startElement 'browseNodeId' 
characters '306533011' 
endElement 'browseNodeId' 
characters ' 
' 
characters '  ' 
startElement 'browseNodeAttributes' 
characters ' 
' 
characters '   ' 
startElement 'attribute' 
characters 'temperature-controllers' 
endElement 'attribute' 
characters ' 
' 
characters '  ' 
endElement 'browseNodeAttributes' 
characters ' 
' 
characters '  ' 
startElement 'browseNodeName' 
characters 'Temperature Controllers' 
endElement 'browseNodeName' 
characters ' 
' 
characters '  ' 
startElement 'browseNodeStoreContextName' 
characters 'Temperature Controllers' 
endElement 'browseNodeStoreContextName' 
characters ' 
' 
characters '  ' 
Traceback (most recent call last): 
    File "/home/gtac/sax/parser.py", line 26, in <module> 
    main("nodes.xml") 
    File "/home/gtac/sax/parser.py", line 23, in main 
    xml.sax.parse(source, ABContentHandler()) 
    File "/usr/lib/python2.7/xml/sax/__init__.py", line 33, in parse 
    parser.parse(source) 
    File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse 
    xmlreader.IncrementalParser.parse(self, source) 
    File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse 
    self.feed(buffer) 
    File "/usr/lib/python2.7/xml/sax/expatreader.py", line 214, in feed 
    self._err_handler.fatalError(exc) 
    File "/usr/lib/python2.7/xml/sax/handler.py", line 38, in fatalError 
    raise exception 
xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token) 
startElement 'browsePathById' 
characters '16310091,16310161,256409011,5006566011,306533011' 
endElement 'browsePathById' 
characters ' 
' 
characters '  ' 
startElement 'browsePathByName' 
characters 'Industrial ' 

Process finished with exit code 1 
+2

你为什么把它叫做一个XML文件时,它显然不是单独的&将焉附?您不能指望XML解析器解析不是格式良好的XML的东西。 –

回答

2

错误显示的问题是在该行和字符。它是在&在

<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName> 

的问题,这不是有效的XML有其自身的&。 &开头的实体

W3C Recommendation in section 2.4 Character Data and Markup

与符号字符(&)和左尖括号(<)不能出现在它们的字面的形式,作为标记定界符时,或在注释除,处理指令或CDATA部分。如果在别处需要它们,则必须使用数字字符引用或字符串“& amp;”和“& lt;”分别。右括号(>)可以用字符串“& gt”表示,并且为了兼容性,必须使用“& gt;”或当字符串出现在内容中的字符串“]]>”时的字符引用,当该字符串不标记CDATA节的结尾时。

正确的修复方法是告诉XML的作者,他们的输出是无效的,他们必须修复它。

否则,您必须先分析文本,并更换所有&amp;

+0

但是,有没有其他方法可以做到这一点?我正在处理53 MB大小的文件。 – pnv

+0

没有任何XML工具**必须**失败否则它不能被信任 – Mark