从lxml跨度中提取文本？

考虑：从lxml跨度中提取文本？

import urllib2 
from lxml import etree 

url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000" 
response = urllib2.urlopen(url) 
htmlparser = etree.HTMLParser() 
tree = etree.parse(response, htmlparser)

其中URL是一个标准的eBay搜索结果页面有一些过滤应用：

我期待例如提取的产品价格$ 40.00 $ 34.95等等

有几个可能的XPath（如由萤火虫提供的，XPath的检查Firefox插件，和源的手动检查）：

/html/body/div[5]/div[2]/div[3]/div/div[1]/div/div[3]/div/div[1]/div/w-root/div/div/ul/li[1]/ul[1]/li[1]/span 
id('item3d00cf865e')/x:ul[1]/x:li[1]/x:span 
//span[@class ='bold bidsold']

选择后者：

xpathselector="//span[@class ='bold bidsold']"

tree.xpath(xpathselector)然后按预期的方式返回一个Element对象的列表。当我获得.text属性时，我预计会得到价格。但我得到的是：

In [17]: tree.xpath(xpathselector) 
Out[17]: 
['\n\t\t\t\t\t', 
u' 1\xc2\xa0103.78', 
'\n\t\t\t\t\t', 
u' 1\xc2\xa0048.28', 
'\n\t\t\t\t\t', 
' 964.43', 
'\n\t\t\t\t\t', 
' 922.43', 
'\n\t\t\t\t\t', 
' 922.43', 
'\n\t\t\t\t\t', 
' 275.67', 
'\n\t\t\t\t\t',

包含在每个值看起来像价格，但（我）的价格比显示在网页上有什么显着更高，（二）我不知道什么都换行符和标签正在那里做。 在尝试提取价格时，我在这里存在根本性错误吗？

我通常使用WebDriver来处理这类事情，并利用CSS选择器，xpath和class来查找元素。但在这种情况下，我不需要浏览器交互，这就是为什么我第一次使用urllib2和lxml。

等

来源

2015-10-05 Pyderman

我看到2个可能：

它看起来像eBay检查区域设置和转换价格根据您所在的国家的货币。一旦你通过浏览器打开页面，它可能会读取一些浏览器设置，一旦你执行代码，它可以从其他地方读取设置。
价格可能会调整eBay使用JavaScript（客户端），所以你不能捕捉到你的解析器。

我会建议下一检查：

检查你所拥有的货币，当你运行代码
检查网页的源并确认你在浏览器中看到的价格也完全相同。

来源

2015-10-05 21:09:53

谢谢，是仔细观察我看到它是两者：（i）将货币，以及（ii）以西班牙语将该页面返回到urllib2。 urllib2有欺骗位置的方法吗？ – Pyderman

@Pyderman尝试检查您的请求的样子。找到一个工具，你可能会发现一些有关区域设置的信息。 –

我写有关python

两个例子

实施例1：

import urllib2 
from lxml import etree 

if __name__ == '__main__': 
    url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000" 
    response = urllib2.urlopen(url) 
    htmlparser = etree.HTMLParser() 
    tree = etree.parse(response, htmlparser) 
    xpathselector="//span[@class ='bold bidsold']" 
    for i in tree.xpath(xpathselector): 
     print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

实施例2：

import urllib2 
from lxml import etree 

if __name__ == '__main__': 
    url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000" 
    response = urllib2.urlopen(url) 
    htmlparser = etree.HTMLParser() 
    tree = etree.parse(response, htmlparser) 
    xpathselector="//span[@class ='bold bidsold']|//span[@class='sboffer']" 
    for i in tree.xpath(xpathselector): 
     print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

来源

2015-10-05 21:10:54 Randomazer

从lxml跨度中提取文本？

回答

相关问题