网站部队scrapy重定向

我一直在从www.caribbeanjobs.com重定向。我编写了我的蜘蛛不遵守robot.txt，禁用cookies，尝试meta = dont_redirect。我还可以做些什么？网站部队scrapy重定向

这是我的蜘蛛以下：

import scrapy 

from tutorial.items import CaribbeanJobsItem 

class CaribbeanJobsSpider(scrapy.Spider): 
     name = "caribbeanjobs" 
     allowed_domains = ["caribbeanjobs.com/"] 
     start_urls = [ 
     "http://www.caribbeanjobs.com/" 
     ] 
     def start_requests(self): 
      for url in self.start_urls: 
       yield scrapy.Request(url, meta={'dont_redirect':True}) 

     def parse(self, response): 
       if ".com" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

这是我的设置：

BOT_NAME = 'tutorial' 

SPIDER_MODULES = ['tutorial.spiders'] 
NEWSPIDER_MODULE = 'tutorial.spiders' 


# Crawl responsibly by identifying yourself (and your website) on the user-agent 
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)' 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = False 

# Configure maximum concurrent requests performed by Scrapy (default: 16) 
#CONCURRENT_REQUESTS = 32 

# Configure a delay for requests for the same website (default: 0) 
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 
# See also autothrottle settings and docs 
DOWNLOAD_DELAY = 3 
# The download delay setting will honor only one of: 
#CONCURRENT_REQUESTS_PER_DOMAIN = 16 
#CONCURRENT_REQUESTS_PER_IP = 16 

# Disable cookies (enabled by default) 
COOKIES_ENABLED = False 

# Disable Telnet Console (enabled by default) 
#TELNETCONSOLE_ENABLED = False 

# Override the default request headers: 
#DEFAULT_REQUEST_HEADERS = { 
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
# 'Accept-Language': 'en', 
#} 

# Enable or disable spider middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 
#SPIDER_MIDDLEWARES = { 
# 'tutorial.middlewares.MyCustomSpiderMiddleware': 543, 
#} 

# Enable or disable downloader middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 
#DOWNLOADER_MIDDLEWARES = { 
# 'tutorial.middlewares.MyCustomDownloaderMiddleware': 543, 
#} 

# Enable or disable extensions 
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 
#EXTENSIONS = { 
# 'scrapy.extensions.telnet.TelnetConsole': None, 
#} 

# Configure item pipelines 
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 
#ITEM_PIPELINES = { 
# 'tutorial.pipelines.SomePipeline': 300, 
#} 

# Enable and configure the AutoThrottle extension (disabled by default) 
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html 
#AUTOTHROTTLE_ENABLED = True 
# The initial download delay 
#AUTOTHROTTLE_START_DELAY = 5 
# The maximum download delay to be set in case of high latencies 
#AUTOTHROTTLE_MAX_DELAY = 60 
# The average number of requests Scrapy should be sending in parallel to 
# each remote server 
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 
# Enable showing throttling stats for every response received: 
#AUTOTHROTTLE_DEBUG = False 

# Enable and configure HTTP caching (disabled by default) 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 
#HTTPCACHE_ENABLED = True 
#HTTPCACHE_EXPIRATION_SECS = 0 
#HTTPCACHE_DIR = 'httpcache' 
#HTTPCACHE_IGNORE_HTTP_CODES = [] 
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

来源

2016-08-24 Jimbo

这可能会检查你的用户代理，并始终重定向你，如果你还没有被尊重他们的网站与您的机器人 – TankorSmash

你有没有尝试设置中设置一个明确USER_AGENT？

http://doc.scrapy.org/en/latest/topics/settings.html#user-agent

像这样的东西可能工作为出发点：

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36"`

来源

2016-08-24 20:52:45

这个工作，非常感谢你！我猜这是因为我的用户代理不是浏览器，他们会立即重定向我。它是否正确？ – Jimbo

他们可能在寻找非标准的用户代理，或者spacy的默认值可能是他们明确注意的东西。本质上是一场针对内容挖掘者和内容发布者的军备竞赛。 –

您可以指定handle_http_status。你可以在start_urls后初始化这个列表。

handle_http_status = ['303', '301']

来源

2016-08-25 06:49:45

你也可以这样做，作为请求元键，即'Request（url，meta = {'handdle_httpstatus_list'：[303，301]}' – Granitosaurus

是的，你是对的，我们也可以这样做，但是我提到的那个有全球效应。 –

网站部队scrapy重定向

回答

相关问题