使用utf-16解析LXML Xpath失败

我正在解析以下页面：http://www.amazon.de/product-reviews/B004K1K172 使用基于lxml的etree进行解析。包含整个页面内容使用utf-16解析LXML Xpath失败

代码

内容变量：

myparser = etree.HTMLParser(encoding="utf-16") #As characters are beyond utf-8 
tree = etree.HTML(content,parser = myparser) 
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

这是返回一个空列表。

但是，当我改变代码：

myparser = etree.HTMLParser(encoding="utf-8") #Neglecting some reviews having ascii character above utf-8 
tree = etree.HTML(content,parser = myparser) 
review = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

现在我用同样的XPath得到正确的数据。但大多数评论被拒绝。那么这是基于lxml的xpath或我的xpath实现的问题？

我该如何解析utf-16编码的上述页面？

来源

2013-03-05 Kratos85

我认为你应该使用'tree.xpath（” .//*[@ ID = 'productReviews']/TR/TD [1]/DIV /文（）“）'。此外，http：//www.amazon.de/product-reviews/B004K1K172在ISO-8859-15中编码，但不在utf-16中编码。 – nymk 2013-03-05 11:58:42

xpath只用于选择第一个review.Code通过更改最后一个div [n] value来继续循环查看评论。我将使用ISO-8859-15编码检查lxml xpath。 – Kratos85 2013-03-05 14:15:27

@ nymk.Thanks的建议。现在我能够使用ISO-8859-15编码成功解析页面。 – Kratos85 2013-03-06 08:40:49

根据nymk的建议

使用ISO-8859-15编码解析页面。因此在代码中更改以下行。

myparser = etree.HTMLParser（encoding =“ISO-8859-15”）
但是，必须在SQL中进行更改才能接受utf-8以外的编码。

来源

2013-03-06 08:45:01 Kratos85

要想从HTTP头中的字符编码自动：

import cgi 
import urllib2 

from lxml import html 

response = urllib2.urlopen("http://www.amazon.de/product-reviews/B004K1K172") 

# extract encoding from Content-Type 
_, params = cgi.parse_header(response.headers.get('Content-Type', '')) 
html_text = response.read().decode(params['charset']) 

root = html.fromstring(html_text) 
reviews = root.xpath(".//*[@id='productReviews']/tr/td[1]/div[1]/text()")

来源

2013-03-06 09:45:51 jfs

使用utf-16解析LXML Xpath失败

回答

相关问题