2012-04-26 49 views
2

如何让下面的脚本一次下载多个链接而不是一次一个地使用urllib2?python urllib2多种下载

蟒蛇:

from BeautifulSoup import BeautifulSoup 
import lxml.html as html 
import urlparse 
import os, sys 
import urllib2 
import re 

print ("downloading and parsing Bibles...") 
root = html.parse(open('links.html')) 
for link in root.findall('//a'): 
    url = link.get('href') 
    name = urlparse.urlparse(url).path.split('/')[-1] 
    dirname = urlparse.urlparse(url).path.split('.')[-1] 
    f = urllib2.urlopen(url) 
    s = f.read() 
    if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname) 
    soup = BeautifulSoup(s) 
    articleTag = soup.html.body.article 
    converted = str(articleTag) 
    full_path = os.path.join(dirname, name) 
    open(full_path, 'w').write(converted) 
    print(name) 
print("DOWNLOADS COMPLETE!") 

links.html

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.5.nmv-fas">http://www.youversion.com/bible/gen.5.nmv-fas</a> 

<a href="http://www.youversion.com/bible/gen.6.nmv-fas">http://www.youversion.com/bible/gen.6.nmv-fas</a> 
+0

你尝试过什么? [这是一个出发点](http://docs.python.org/library/threading.html#thread-objects)。和[类似的问题](http://stackoverflow.com/questions/4131069/need-some-assistance-with-python-threading-queue)。 – AdamKG 2012-04-26 16:56:14

+0

我意识到你问了urllib,但你可能想看看scrapy。它是非常成熟的异步它可以让你用很少的努力做出多个请求 – dm03514 2012-04-26 17:27:34

回答

1

Blainer,尝试穿线。

这里有一个很好的实际例子

http://www.ibm.com/developerworks/aix/library/au-threadingpython/

然后引用蟒蛇STD库以及

http://docs.python.org/library/threading.html

如果你看看在实际例子,实际上具有的线程版本的样本urllib2并发下载。我我继续带你几个步骤更进的过程中,你将有,说解决这个问题,以进一步解析您的HTML出部分工作..

#!/usr/bin/env python 

import Queue 
import threading 
import urllib2 
import time 
import htmllib, formatter 

class LinksExtractor(htmllib.HTMLParser): 
    # derive new HTML parser 

    def __init__(self, formatter):   
     # class constructor 
     htmllib.HTMLParser.__init__(self, formatter) 
     # base class constructor 
     self.links = []   
     # create an empty list for storing hyperlinks 

    def start_a(self, attrs) : # override handler of <A ...>...</A> tags 
     # process the attributes 
     if len(attrs) > 0 : 
      for attr in attrs : 
       if attr[0] == "href":   
        # ignore all non HREF attributes 
        self.links.append(attr[1]) # save the link info in the list 

    def get_links(self) :  
     # return the list of extracted links 
     return self.links 

format = formatter.NullFormatter() 
htmlparser = LinksExtractor(format) 

data = open("links.html") 
htmlparser.feed(data.read()) 
htmlparser.close() 

hosts = htmlparser.links 

queue = Queue.Queue() 

class ThreadUrl(threading.Thread): 
    """Threaded Url Grab""" 
    def __init__(self, queue): 
     threading.Thread.__init__(self) 
     self.queue = queue 

    def run(self): 
     while True: 
      #grabs host from queue 
      host = self.queue.get() 

      #################################### 
      ############FIX THIS PART########### 
      #VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV# 

      url = urllib2.urlopen(host) 
      morehtml = url.read() # your own your own with this 

      #signals to queue job is done 
      self.queue.task_done() 

start = time.time() 
def main(): 
    #spawn a pool of threads, and pass them queue instance 
    for i in range(5): 
     t = ThreadUrl(queue) 
     t.setDaemon(True) 
     t.start() 

     #populate queue with data 
    for host in hosts: 
     queue.put(host) 

    #wait on the queue until everything has been processed  
    queue.join() 

main() 
print "Elapsed Time: %s" % (time.time() - start) 
+0

我看了一下,但我的脚本只抓取一个网址,一次从links.html ...我怎么能让变量“网址“立即获取所有链接? – Blainer 2012-04-26 17:03:32

+0

在这里更新了答案 – dc5553 2012-04-26 17:12:06

+2

解析脚本中较高的html并创建一个列表,然后在其下载时间释放线程。 (我要说释放地狱,但你下载圣经!哈哈) – dc5553 2012-04-26 17:18:39