2014-09-21 61 views
1

我想要一个图像的src url,当我处理一些html,但我找回了一个编码的图像。如果我想要url,我做错了什么?lxml etree和xpath返回一个编码的图像,而不是URL的src

考虑像URL: “http://www.amazon.com/Cheese-Plate-multi-purpose-mounting-plate/dp/B00CI06DWE/

和桌面用户代理:

from lxml import etree 
import requests 

page = requests.get(url, headers=agent) 
page_txt = page.text 

html_parser = etree.HTMLParser() 
tree = etree.parse(StringIO(page_txt), html_parser) 

path = '//img[@id="landingImage"]' 

img = tree.xpath(path) 

img_src = img[0].get('src') 

使用的代码,我得到回:

“\ n数据:图像/ JPEG ; BASE64,/ 9J/4AAQSkZJR'(截短)

时,我想:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL.SY355.jpg

回答

0

src属性中有base64 encoded image。你可以从data-a-dynamic-image属性实际的URL,它包含JSON字符串的URL中:

import json 

path = '//img[@id="landingImage"]/@data-a-dynamic-image' 
print next(json.loads(tree.xpath(path)[0]).iterkeys()) 

打印:

http://ecx.images-amazon.com/images/I/41SNmVfXvhL._SX466_.jpg 
+0

感谢Alecxe!我感谢您的帮助。 – dolphinkickme 2014-09-22 18:09:21

相关问题