2017-08-04 58 views
0

我想刮所有列表的hrefs。我对美丽的手相当陌生,之前做过一些刮,,但之前做过一些刮ping。但我不能为我的生活提取。看下面我的代码。当我运行这个脚本时容器的长度为零。BeautifulSoup分析器无法访问html元素

我尝试和选择价格太高(soup.findAll(“跨度”,{“级”:“量”}),但它并不反映任何意见最受欢迎:)

import urllib.request 
import urllib.parse 
from bs4 import BeautifulSoup 

url = 'https://www.takealot.com/computers/laptops-10130' 
headers = {} 
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17" 
req = urllib.request.Request(url, headers=headers) 
resp = urllib.request.urlopen(req) 

respData = str(resp.read()) 

soup = BeautifulSoup(respData, 'html.parser') 

container = soup.find_all("div", {"class": "p-data left"}) 

回答

0

搜索多类尝试设置类的列表:

container = soup.find_all("div", {"class": ["p-data", "left"]})

或者使用SELECT

soup.select('div.rap.main')

我还没有在html源代码中找到合并的类

+0

谢谢。我尝试了它,它似乎给了我整个汤的对象。我只是尝试了soup.prettify并完成了所有工作 - 我没有在整个输出中的任何地方找到对列表的引用,而且,它看起来像是在使用javascript。这让我感到困惑 - 汤料不能包含在食物中吗? –

+0

如果您的网站使用异步js检索列表并使用其结果填充页面,那么您的爬网程序可能不知道要等待完成。看看量角器和茉莉花这些类型的网站。 – BoboDarph

0

该页面是用JavaScript呈现的。有几种方法来渲染和刮擦它。

我可以用硒刮擦它。 首先安装硒:

sudo pip3 install selenium 

然后拿到驾驶https://sites.google.com/a/chromium.org/chromedriver/downloads您可以使用Chrome“Chrome Canary版”的无头版本,如果你是在Windows或Mac。

from bs4 import BeautifulSoup 
from selenium import webdriver 

browser = webdriver.Chrome() 
url = ('https://www.takealot.com/computers/laptops-10130') 
browser.get(url) 
respData = browser.page_source 
browser.quit() 
soup = BeautifulSoup(respData, 'html.parser') 
containers = soup.find_all("div", {"class": "p-data left"}) 
for container in containers: 
    print(container.text) 
    print(container.find("span", {"class": "amount"}).text) 

或者使用PyQt5

from PyQt5.QtGui import * 
from PyQt5.QtCore import * 
from PyQt5.QtWebKit import * 
from PyQt5.QtWebKitWidgets import QWebPage 
from PyQt5.QtWidgets import QApplication 
from bs4 import BeautifulSoup 
import sys 


class Render(QWebPage): 
    def __init__(self, url): 
     self.app = QApplication(sys.argv) 
     QWebPage.__init__(self) 
     self.loadFinished.connect(self._loadFinished) 
     self.mainFrame().load(QUrl(url)) 
     self.app.exec_() 

    def _loadFinished(self, result): 
     self.frame = self.mainFrame() 
     self.app.quit() 

url = 'https://www.takealot.com/computers/laptops-10130' 
r = Render(url) 
respData = r.frame.toHtml() 
soup = BeautifulSoup(respData, 'html.parser') 
containers = soup.find_all("div", {"class": "p-data left"}) 
for container in containers: 
    print (container.text) 
    print (container.find("span", {"class":"amount"}).text) 

或者使用dryscrape

from bs4 import BeautifulSoup 
import dryscrape 

url = 'https://www.takealot.com/computers/laptops-10130' 
session = dryscrape.Session() 
session.visit(url) 
respData = session.body() 
soup = BeautifulSoup(respData, 'html.parser') 
containers = soup.find_all("div", {"class": "p-data left"}) 
for container in containers: 
    print(container.text) 
    print(container.find("span", {"class": "amount"}).text) 

输出在所有情况下:

Dell Inspiron 3162 Intel Celeron 11.6" Wifi Notebook (Various Colours)11.6 Inch Display; Wifi Only (Red; White & Blue Available)R 3,999R 4,999i20% OffeB 39,990Discovery Miles 39,990On Credit: R 372/monthi 
3,999 
HP 250 G5 Celeron N3060 Notebook - Dark ash silverNBHPW4M70EAR 4,499R 4,999ieB 44,990Discovery Miles 44,990On Credit: R 419/monthiIn StockShippingThis item is in stock in our CPT warehouse and can be shipped from there. You can also collect it yourself from our warehouse during the week or over weekends.CPT | ShippingThis item is in stock in our JHB warehouse and can be shipped from there. No collection facilities available, sorry!JHBWhen do I get it? 
4,499 
Asus Vivobook ... 

但是,当使用您的URL进行测试时,我观察到结果每次都无法重现,偶尔在页面渲染后“容器”中没有内容。

+0

哇,优秀!感谢这! –

+0

如果它适合您,可随时接受答案和/或为其投票(您可以使用答案左侧的按钮)。 –