2017-11-25 23 views
2

我想从网站上刮取一些数据。这是html格式。我想凑字"No description for 632930413867".用于网页浏览的美化工具不起作用?

HTML代码:

<div class="col-xs-6 col-sm-6 col-md-6 col-lg-6"> 
    <table class="table product_info_table"> 
    <tbody> 
     <tr> 
     <td>GS1 Address</td> 
     <td>R.R. 1, Box 2, Malmo, NE 68040</td> 
     </tr> 
     <tr> 
     <td>Description</td> 
     <td> 
      <div id="read_desc"> 
      No description for 632930413867 
      </div> 
     </td> 
     </tr> 
    </tbody> 
    </table> 
</div> 

和图片src从这个网站

<div class="centered_image header_image"> 
<img src="https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg" title="UPC 632930413867" alt="UPC 632930413867"> 

所以我用这个代码

Baseurl = "https://www.buycott.com/upc/632930413867" 
uClient = '' 
while uClient == '': 
    try: 
     uClient = requests.get(Baseurl) 
     print("Relax we are getting the data...") 

    except: 
     print("Connection refused by the server..") 
     print("Let me sleep for 7 seconds") 
     time.sleep(7) 
     print("Was a nice sleep, now let me continue...") 
     continue 


page_html = uClient.content 

uClient.close() 
page_soup = soup(page_html, "html.parser") 

Productcontainer = page_soup.find_all("div", {"class": "row"}) 
link = page_soup.find(itemprop="image") 

print(Productcontainer) 

for item in Productcontainer: 
    print(link) 
    productdescription = Productcontainer.find("div", {"class": "product_info_table"}) 
    print(productdescription) 

当我运行此代码时,不显示数据。我如何获得描述和img src?

回答

3

只有一个页面上的每个(项目和产品描述)的实例,以便你可以去他们直接使用find(),就没有必要在这种情况下使用find_all():

import requests 
from bs4 import BeautifulSoup as soup 

Baseurl = "https://www.buycott.com/upc/632930413867" 
uClient = '' 
while uClient == '': 
    try: 
     uClient = requests.get(Baseurl) 
     print("Relax we are getting the data...") 

    except: 
     print("Connection refused by the server..") 
     print("Let me sleep for 7 seconds") 
     time.sleep(7) 
     print("Was a nice sleep, now let me continue...") 
     continue 

page_html = uClient.content 
uClient.close() 

page_soup = soup(page_html, "html.parser") 
productdescription = page_soup.find("div", {"id": "read_desc"}).text 
link = page_soup.find("div", {"class": "centered_image header_image"}).find("img")['src'] 
print (productdescription) 
print (link) 

输出:

Relax we are getting the data... 

No description for 632930413867 

https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg 
2

你只需要检查HTML和标识按住要刮的数据标签。
在这种情况下,图像为div.centered_image.header_image img,而div#read_desc为描述。
bs4 css selectors一个例子:

import requests 
from bs4 import BeautifulSoup 

baseurl = "https://www.buycott.com/upc/632930413867" 
page_html = requests.get(baseurl).content 
soup = BeautifulSoup(page_html, "html.parser") 
image = soup.select_one('div.centered_image.header_image img')['src'] 
description = soup.select_one('div#read_desc').text.strip() 

print(image) 
print(description) 

https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL.SL160.jpg
为632930413867

0

没有描述这可以这样来完成,以及:

import requests 
from bs4 import BeautifulSoup 

soup = BeautifulSoup(requests.get("https://www.buycott.com/upc/632930413867").text, "lxml") 
desc = soup.select("#read_desc")[0].text.strip() 
link = soup.select(".centered_image img")[0]['src'].strip() 
print("{}\n{}".format(desc,link)) 

输出:

No description for 632930413867 
https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg 
相关问题