2012-04-24 124 views
-1

对不起,我花了你的时间,但我真的被阻止!Python - 线程和urlopen(urllib2)和解析

我在Python中是一个n00b,但我努力学习,我试图让这个脚本运行。它的工作没有线程,但为了学习和提高我的Python技能,我想了解这有什么问题!

问题: - 剧本永远不会结束 - 它不解析什么...的东西的urlopen似乎没有正常工作

非常感谢您的帮助,我还在努力:-)

import Queue 
import threading 
import urllib2 
from urllib2 import urlopen 
import time 
from bs4 import BeautifulSoup as BeautifulSoup 
import xlwt 
import time 
import socket 

socket.setdefaulttimeout(20.0) 


class Retry(object): 
    default_exceptions = (Exception,) 
    def __init__(self, tries, exceptions=None, delay=0): 
     """ 
     Decorator for retrying a function if exception occurs 

     tries -- num tries 
     exceptions -- exceptions to catch 
     delay -- wait between retries 
     """ 
     self.tries = tries 
     if exceptions is None: 
      exceptions = Retry.default_exceptions 
     self.exceptions = exceptions 
     self.delay = delay 

    def __call__(self, f): 
     def fn(*args, **kwargs): 
      exception = None 
      for _ in range(self.tries): 
       try: 
        return f(*args, **kwargs) 
       except self.exceptions, e: 
        print "Retry, exception: "+str(e) 
        time.sleep(self.delay) 
        exception = e 
      #if no success after tries, raise last exception 
      raise exception 
     return fn 

@Retry(5) 
def open_url(source): 
    print("OPENING %s" % source) 
    print("Retrying to open and read the page") 
    resp = urlopen(source) 
    resp = resp.read() 
    return resp 



queue = Queue.Queue() 
out_queue = Queue.Queue() 

class ThreadUrl(threading.Thread): 
    """Threaded Url Grab""" 
    def __init__(self, queue, out_queue): 
     threading.Thread.__init__(self) 
     self.queue = queue 
     self.out_queue = out_queue 

    def run(self): 
     while True: 
      #grabs host from queue 
      host = self.queue.get() 

      #grabs urls of hosts and then grabs chunk of webpage 
      chunk = open_url(host) 
      #chunk = url.read() 

      #place chunk into out queue 
      self.out_queue.put(chunk) 

      #signals to queue job is done 
      self.queue.task_done() 

class DatamineThread(threading.Thread): 
    """Threaded Url Grab""" 
    def __init__(self, out_queue): 
     threading.Thread.__init__(self) 
     self.out_queue = out_queue 

    def run(self): 
     global x 
     while True: 
      #grabs host from queue 
      chunk = self.out_queue.get() 

      #parse the chunk 
      soup = BeautifulSoup(chunk) 
      #print soup 
      tableau = soup.findAll('table') 
     rows = tableau[1].findAll('tr') 
      print("DONE") 
     for tr in rows: 
      cols = tr.findAll('td') 
       y = 0 
       x = x + 1 
      for td in cols: 
        texte_bu = td.text 
        texte_bu = texte_bu.encode('utf-8') 
      print texte_bu 
        ws.write(x,y,td.text) 
        y = y + 1 
     wb.save("IA.xls") 

      #signals to queue job is done 
      self.out_queue.task_done() 
      break 

start = time.time() 
def main(): 

    #spawn a pool of threads, and pass them queue instance 
    for i in range(13): 
     t = ThreadUrl(queue, out_queue) 
     t.setDaemon(True) 
     t.start() 

    #populate queue with data 
    for host in hosts: 
     queue.put(host) 

    for i in range(1): 
     dt = DatamineThread(out_queue) 
     dt.setDaemon(True) 
     dt.start() 


    #wait on the queue until everything has been processed 
    queue.join() 
    out_queue.join() 


global x 
x = 0 

wb = xlwt.Workbook(encoding='utf-8') 
ws = wb.add_sheet("BULATS_IA_PARSED") 

Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam'] 
hosts = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List] 

main() 

print "Elapsed Time: %s" % (time.time() - start) 

PS:另外,你认为urllib3(keep-connexion)在这种情况下可以有用,你能解释一下谁来实现这个。

回答

1

我必须承认我没有审查您发布的所有代码,但“thread”和“urllib2”一起足以导致警报。

不要试图将urllib2用于除单线程同步连接之外的其他任何事情!不是因为urllib2有什么问题,而只是因为这个问题已经解决了,并且解决方案在Twisted中,这是一个用于Python的异常良好的异步网络库。

+1

嗯,我要把眼睛上扭曲了!谢谢 ! 但是,该脚本正在为一些国家...并在没有...广告有时是...奇怪! – 2012-04-24 03:58:38

+1

编辑:据我的小研究...扭曲似乎对我来说太难了。我的代码真的无法工作? (还有,urllib3在这里可以用吗?) – 2012-04-24 04:04:24

+0

Twisted肯定比尝试自己实现线程更容易。我没有尝试过urllib3,但是从描述中听起来,我发现它会比尝试使用urllib2本身有所改进,但不如像Twisted这样的全功能库。 – 2012-04-24 04:11:01

1

脚本不会结束,因为run方法包含无限循环,没有什么让他们打破这个循环

while True: 
+0

哈哈,你是对的!我正在修复它! 谢谢! – 2012-04-24 09:33:10

+0

我想这是一个缩进问题 – marbdq 2012-04-24 09:50:33