2017-05-28 61 views
-1

我试图从网站中的多个页面中提取一些数据,并使用Javascript生成内容。 所以我使用PyQt4和美丽的汤来循环页面并提取一些数据字段。使用PyQt4和美丽的汤来浏览网页

import sys 
from bs4 import BeautifulSoup 
from PyQt4.QtGui import QApplication 
from PyQt4.QtCore import QUrl 
from PyQt4.QtWebKit import QWebPage 


class Client(QWebPage): 

    def __init__(self, url): 
     self.app = QApplication(sys.argv) 
     QWebPage.__init__(self) 
     self.loadFinished.connect(self.on_page_load) 
     self.mainFrame().load(QUrl(url)) 
     self.app.exec_() 

    def on_page_load(self): 
     self.app.quit() 

products_titles = [] 
urls= ['url1', 'url2', 'url3'] 

for url in urls: 
    print "Parsing URL: " + url + '\n' 
    client_response = Client(url) 
    source = client_response.mainFrame().toHtml() 
    soup = BeautifulSoup(source, "html.parser") 
    print get_product_category(soup) 

但是当我运行它击碎并给出了此错误:

QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration) 
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool) 
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted() 
[1] 14809 segmentation fault python products.py 

我不知道我是我做错了,请你知道什么事情帮助。

回答

1

我会发送一个URL列表,并让QApplication的一个实例按顺序加载它们,而不是实例化和销毁一堆QApplications。

换句话说尝试更多的东西像这样...

import sys 
from bs4 import BeautifulSoup 
from PyQt4.QtGui import QApplication 
from PyQt4.QtCore import QUrl, pyqtSignal 
from PyQt4.QtWebKit import QWebPage 

class Client(QWebPage): 

    new_url = pyqtSignal(['QString'], name='new_url') 

    def __init__(self, urls): 
     self.app = QApplication(sys.argv) 
     self.urls = urls 
     self.pages = dict() 
     QWebPage.__init__(self) 
     self.new_url.connect(self.load_url) 
     self.loadFinished.connect(self.on_page_load) 
     if len(self.urls): 
      self.new_url.emit(urls.pop()) 
     self.app.exec_() 

    def load_url(self, url): 
     self.current_url = url 
     print "Loading: {0}".format(url) 
     self.mainFrame().load(QUrl(url)) 

    def on_page_load(self): 
     print "Retrieved: {0}".format(self.current_url) 
     self.pages[self.current_url] = unicode(self.mainFrame().toHtml()) 
     if len(self.urls): 
      self.new_url.emit(self.urls.pop()) 
     else: 
      self.app.quit() 

urls= ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] 

client = Client(urls) 
for (url, page) in client.pages.items(): 
    soup = BeautifulSoup(page, "html.parser") 
    print "{0}\t{1}".format(url, soup.title.text) 

重新实例一堆QApplications的似乎是一个非常糟糕的主意,我可以理解这种情况下分割故障。但是,在分段错误之前的网络错误对我来说似乎有点奇怪。试一试,看看你是否有更好的运气。它对我来说工作得很好。

+0

谢谢,它工作得很好,比我的解决方案更快! – melhirech