2012-08-23 51 views
1

我设置每个请求之间的延迟并希望看到效果。但似乎没有发生。我设置scrapy:如何测试每个请求之间的延迟

DOWNLOAD_DELAY=5 

CONCURRENT_REQUESTS=1 

CONCURRNT_REQUESTS_PER_IP=1 

RANDOM_DOWNLOAD_DELY=False 

我想如果它的工作,我会看到的每个请求之间5秒的延迟。但它没有发生。

下面的代码片段是蜘蛛:

class Useragent(BaseSpider): 

    name = 'useragent' 

    settings.overrides['DOWNLOAD_DELAY'] = 5 
    settings.overrides['CONCURRENT_REQUESTS'] = 1 
    settings.overrides['CONCURRENT_REQUESTS_PER_DOMAIN'] = 1 
    settings.overrides['RANDOM_DOWNLOAD_DELAY'] = False 

    fn_useragents = "utils/useragents.txt" 
    fp_useragents = open(fn_useragents, 'rb') 
    total_lines = len(fp_useragents.readlines()) 
    fp_useragents.close() 

    if not os.path.isdir("data"): 
     os.mkdir("data") 
    fn_log = "data/log.txt" 
    fp_log = open(fn_log, "ab+") 

    def start_requests(self): 
     urls = [ 
      'http://www.dangdang.com', 
      'http://www.360buy.com', 
      'http://www.amazon.com.cn', 
      'http://www.taobao.com' 
      ] 

     for url in urls: 
      ua = linecache.getline(Useragent.fn_useragents, randint(1, Useragent.total_lines)).strip() 
      url_headers = {'User-Agent': ua} 
      yield Request(url, callback=self.parse_origin, headers=url_headers) 

    def parse_origin(self, response): 
     current_url = response.url 
     headers = response.request.headers 

     data_log = current_url 
     for k, v in headers.items(): 
      header = "%s\t%s" % (k, v) 
      data_log = "\n".join((data_log, header)) 
     Useragent.fp_log.write("%s\n" % data_log) 

UPDATE

我又写了蜘蛛看到设置DOWNLOAD_DELAY的效果,下面是代码:

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.conf import settings 
import sys, os, time 

reload(sys) 
sys.setdefaultencoding('utf-8') 

class TestCrawl(CrawlSpider): 

    name = 'crawldelay' 
    start_urls = [ 
     'http://www.dangdang.com', 
     ] 

    rules = (
     Rule(SgmlLinkExtractor(allow=('.+'),), callback="parse_origin"), 
     ) 

    def __init__(self): 
     CrawlSpider.__init__(self) 
     if not os.path.isdir("data"): 
      os.mkdir("data") 
     self.fn_log = "data/log.txt" 
     self.fp_log = open(self.fn_log, 'ab+') 

     settings.overrides['DOWNLOAD_DELAY'] = 60 
     settings.overrides['RANDOM_DOWNLOAD_DELAY'] = False 
     settings.overrides['CONCURRENT_REQUESTS'] = 1 
     settings.overrides['CONCURRENT_REQUESTS_PER_IP'] = 1 

    def parse_origin(self, response): 
     current_url = response.url 
     data_log = "%s\n%s\n\n" % (current_url, time.asctime()) 
     self.fp_log.write(data_log) 

以下是我用来查看设置DOWNLOAD_DELAY的效果的日志文件的一部分:

http://living.dangdang.com/furniture 
Mon Aug 27 10:49:50 2012 

http://static.dangdang.com/topic/744/200778.shtml 
Mon Aug 27 10:49:50 2012 

http://survey.dangdang.com/html/2389.html 
Mon Aug 27 10:49:50 2012 

http://fashion.dangdang.com/watch 
Mon Aug 27 10:49:50 2012 

https://login.dangdang.com/signin.aspx?returnurl=http://customer.dangdang.com/wishlist/ 
Mon Aug 27 10:49:50 2012 

http://www.hd315.gov.cn/beian/view.asp?bianhao=010202001051000098 
Mon Aug 27 10:49:51 2012 

https://ss.cnnic.cn/verifyseal.dll?pa=2940051&sn=2010091900100002234 
Mon Aug 27 10:49:51 2012 

但似乎DOWNLOAD_DELAY没有明显的效果。

+0

你为什么在覆盖将download_delay?将它直接放在类体中:'class Useragent(BaseSpider):\ n name ='useragent'\ n download_delay = 60' –

回答

0

您只能将属性赋值和方法直接放在类体中。如果您要初始化对象代码,那么你需要重写__init__()

class UseragentSpider(BaseSpider): 

    name = 'useragent' 
    fn_log = "data/log.txt" 
    fn_useragents = "utils/useragents.txt" 
    DOWNLOAD_DELAY = 5 

    def __init__(self, name=None, **kwargs): 
     settings.overrides['CONCURRENT_REQUESTS'] = 1 
     settings.overrides['CONCURRENT_REQUESTS_PER_DOMAIN'] = 1 
     settings.overrides['RANDOM_DOWNLOAD_DELAY'] = False 

     fp_useragents = open(self.fn_useragents, 'rb') 
     self.total_lines = len(fp_useragents.readlines()) 
     fp_useragents.close() 

     if not os.path.isdir("data"): 
      os.mkdir("data") 
     self.fp_log = open(self.fn_log, "ab+") 

     # remember to call BaseSpider __init__() since we're overriding it 
     super(UseragentSpider, self).__init__(name, **kwargs) 

    def start_requests(self): 
     urls = ['http://www.dangdang.com', 
       'http://www.360buy.com', 
       'http://www.amazon.com.cn', 
       'http://www.taobao.com', 
      ] 

     for url in urls: 
      ua = linecache.getline(self.fn_useragents, randint(1, self.total_lines)).strip() 
      url_headers = {'User-Agent': ua} 
      yield Request(url, callback=self.parse_origin, headers=url_headers) 

    def parse_origin(self, response): 
     headers = response.request.headers 
     data_log = response.url 

     for k, v in headers.items(): 
      header = "%s\t%s" % (k, v) 
      data_log = "\n".join((data_log, header)) 

     self.fp_log.write("%s\n" % data_log) 
+0

我重建了我的源文件。但它很快抓取网页,我看不到每个请求的延迟。也许我的想法是错误的。我只是想看到下载适用。 – flyer

+0

也许可以将download_delay增加到60并包含我们看到的日志输出。 –

+0

对不起,这么晚回复。我更新了我的问题,并通过了代码下面的结果。 – flyer

1

它是由dnscache的实施引起的(延迟)。
CONCURRENT_REQUESTS_PER_IP只适用于第二个同域请求。
您可以覆盖LocalCacheget()方法以使其返回固定值。
它会导致scrapy查看对同一IP的每个请求。


from scrapy.utils.datatypes import LocalCache 
LocalCache.get = lambda *args:'fake-dummy-domain' 

测试你的蜘蛛:

scrapy crawl crawldelay -s CONCURRENT_REQUESTS_PER_IP=1 -s DOWNLOAD_DELAY=1 
+0

测试对我很好。 –