如何使用beautifulsoup在亚马逊网页上刮去产品详细信息

对于网页：http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG 如何在python中刮取产品详细信息并输出dict。在上述情况下，字典输出我想有会：如何使用beautifulsoup在亚马逊网页上刮去产品详细信息

Age Range: 9 - 12 years 
Grade Level: 4 - 7 
... 
...

我是新来beautifulsoup并没有找到很好的例子，来实现这一目标。我想要举一些例子。

来源

2014-10-31 so3

你有没有做过任何尝试？ – 2014-10-31 20:28:44

你到目前为止尝试过什么？ – Hackaholic 2014-10-31 20:30:57

看看'mechanize'和'BeautifulSoup'，看看这个答案的例子：http://stackoverflow.com/a/19284156/2327821通常，你应该做更多的腿工作，然后再问你这样一个开放最终的问题。 – Michael 2014-10-31 20:35:41

的想法是所有Product Details项目迭代与table#productDetailsTable div.content ul liCSS selector的帮助下，然后使用粗体文字作为重点和next sibling作为值：

from pprint import pprint 
from bs4 import BeautifulSoup 
import requests 

url = 'http://www.amazon.com/dp/0439136369' 
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}) 

soup = BeautifulSoup(response.content) 
tags = {} 
for li in soup.select('table#productDetailsTable div.content ul li'): 
    try: 
     title = li.b 
     key = title.text.strip().rstrip(':') 
     value = title.next_sibling.strip() 

     tags[key] = value 
    except AttributeError: 
     break 

pprint(tags)

打印：

{ 
    u'Age Range': u'9 - 12 years', 
    u'Amazon Best Sellers Rank': u'#1,440 in Books (', 
    u'Average Customer Review': u'', 
    u'Grade Level': u'4 - 7', 
    u'ISBN-10': u'0439136369', 
    u'ISBN-13': u'978-0439136365', 
    u'Language': u'English', 
    u'Lexile Measure': u'880L', 
    u'Mass Market Paperback': u'448 pages', 
    u'Product Dimensions': u'1.2 x 5.2 x 7.8 inches', 
    u'Publisher': u'Scholastic Paperbacks (September 11, 2001)', 
    u'Series': u'Harry Potter (Book 3)', 
    u'Shipping Weight': u'11.2 ounces (' 
}

请注意，只要我们点击了AttributeError，我们就打破了循环。发生在li元素内部没有更多粗体文本时发生。

来源

2014-10-31 23:03:09 alecxe

谢谢你的回答。但为什么你把标题信息放在requests.get中？ – so3 2014-11-02 17:47:53

@ so3它只是我很习惯这样做:) – alecxe 2014-11-02 19:03:55

@alecxe你知道我为什么只有{'Age Range'：'9 - 12 years'，'Grade Level'：'4 - 7'} when我将“html.parser”参数传递给soup = BeautifulSoup（response.content，“html.parser”）？ – multigoodverse 2015-12-20 09:50:52

from bs4 import BeautifulSoup 
import urllib 
import urllib2 
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'} 
url = 'http://www.amazon.com/dp/0439136369' 
data = urllib.urlencode(headers) 
req = urllib2.Request(url,data) 
soup = BeautifulSoup(urllib2.urlopen(req).read()) 
for x in soup.find_all('table',id='productDetailsTable'): 
    for tag in x.find_all('li'): 
     tag.get_text()

从上面的代码，你可以提取表中的文本，我还没有格式化打印或放在字典，因为你说你需要一点帮助。所以我在上面的代码中做了什么。我需要更改user-agent，因为亚马逊不允许python user-agent。使用find_all 我找到id=productDetailsTable'表。那么我正在循环查找所有li标记，因为所有信息都存储在此标记中。

来源

2014-10-31 21:22:38 Hackaholic

如何使用beautifulsoup在亚马逊网页上刮去产品详细信息

回答

相关问题