2016-08-24 58 views
1

我一直在从www.caribbeanjobs.com重定向。我编写了我的蜘蛛不遵守robot.txt,禁用cookies,尝试meta = dont_redirect。我还可以做些什么?网站部队scrapy重定向

这是我的蜘蛛以下:

import scrapy 

from tutorial.items import CaribbeanJobsItem 

class CaribbeanJobsSpider(scrapy.Spider): 
     name = "caribbeanjobs" 
     allowed_domains = ["caribbeanjobs.com/"] 
     start_urls = [ 
     "http://www.caribbeanjobs.com/" 
     ] 
     def start_requests(self): 
      for url in self.start_urls: 
       yield scrapy.Request(url, meta={'dont_redirect':True}) 

     def parse(self, response): 
       if ".com" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self) 

这是我的设置:

BOT_NAME = 'tutorial' 

SPIDER_MODULES = ['tutorial.spiders'] 
NEWSPIDER_MODULE = 'tutorial.spiders' 


# Crawl responsibly by identifying yourself (and your website) on the user-agent 
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)' 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = False 

# Configure maximum concurrent requests performed by Scrapy (default: 16) 
#CONCURRENT_REQUESTS = 32 

# Configure a delay for requests for the same website (default: 0) 
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 
# See also autothrottle settings and docs 
DOWNLOAD_DELAY = 3 
# The download delay setting will honor only one of: 
#CONCURRENT_REQUESTS_PER_DOMAIN = 16 
#CONCURRENT_REQUESTS_PER_IP = 16 

# Disable cookies (enabled by default) 
COOKIES_ENABLED = False 

# Disable Telnet Console (enabled by default) 
#TELNETCONSOLE_ENABLED = False 

# Override the default request headers: 
#DEFAULT_REQUEST_HEADERS = { 
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
# 'Accept-Language': 'en', 
#} 

# Enable or disable spider middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 
#SPIDER_MIDDLEWARES = { 
# 'tutorial.middlewares.MyCustomSpiderMiddleware': 543, 
#} 

# Enable or disable downloader middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 
#DOWNLOADER_MIDDLEWARES = { 
# 'tutorial.middlewares.MyCustomDownloaderMiddleware': 543, 
#} 

# Enable or disable extensions 
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 
#EXTENSIONS = { 
# 'scrapy.extensions.telnet.TelnetConsole': None, 
#} 

# Configure item pipelines 
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 
#ITEM_PIPELINES = { 
# 'tutorial.pipelines.SomePipeline': 300, 
#} 

# Enable and configure the AutoThrottle extension (disabled by default) 
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html 
#AUTOTHROTTLE_ENABLED = True 
# The initial download delay 
#AUTOTHROTTLE_START_DELAY = 5 
# The maximum download delay to be set in case of high latencies 
#AUTOTHROTTLE_MAX_DELAY = 60 
# The average number of requests Scrapy should be sending in parallel to 
# each remote server 
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 
# Enable showing throttling stats for every response received: 
#AUTOTHROTTLE_DEBUG = False 

# Enable and configure HTTP caching (disabled by default) 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 
#HTTPCACHE_ENABLED = True 
#HTTPCACHE_EXPIRATION_SECS = 0 
#HTTPCACHE_DIR = 'httpcache' 
#HTTPCACHE_IGNORE_HTTP_CODES = [] 
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 
+0

这可能会检查你的用户代理,并始终重定向你,如果你还没有被尊重他们的网站与您的机器人 – TankorSmash

回答

2

你有没有尝试设置中设置一个明确USER_AGENT

http://doc.scrapy.org/en/latest/topics/settings.html#user-agent

像这样的东西可能工作为出发点:

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36"` 
+0

这个工作,非常感谢你!我猜这是因为我的用户代理不是浏览器,他们会立即重定向我。它是否正确? – Jimbo

+0

他们可能在寻找非标准的用户代理,或者spacy的默认值可能是他们明确注意的东西。本质上是一场针对内容挖掘者和内容发布者的军备竞赛。 –

0

您可以指定handle_http_status。你可以在start_urls后初始化这个列表。

handle_http_status = ['303', '301'] 
+0

你也可以这样做,作为请求元键,即'Request(url,meta = {'handdle_httpstatus_list':[303,301]}' – Granitosaurus

+0

是的,你是对的,我们也可以这样做,但是我提到的那个有全球效应。 –