2016-01-23 116 views
1

新手scrapy在这里。我试图从桥梁网站上抓取一些基本数据,出于某种原因,我一直在重定向回localhost。scrapy重定向到127.0.0.1

对于大多数其他网站(例如教程中的dmoz示例),这种情况不会发生。我的直觉是我没有设置一些东西来处理相关域名。我的蜘蛛(几乎完全一样的一个教程,除了与网址更改):

import scrapy 

class BboSpider(scrapy.Spider): 
    name = "bbo" 
    allowed_domains = ["bridgebase.com"] 
    start_urls = [ 
      "http://www.bridgebase.com/vugraph/schedule.php" 
      ] 

    # rules for parsing main response 
    def parse(self, response): 
     filename = 'test.html' 
     with open(filename, 'wb') as f: 
      f.write(response.body) 

我得到错误的是(相关部分):

2016-01-23 14:21:50 [scrapy] INFO: Scrapy 1.0.4 started (bot: bbo) 
2016-01-23 14:21:50 [scrapy] INFO: Optional features available: ssl, http11 
2016-01-23 14:21:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bbo.spiders', 'SPIDER_MODULES': ['bbo.spiders'], 'BOT_NAME': 'bbo'} 
2016-01-23 14:21:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-01-23 14:21:50 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-01-23 14:21:50 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-01-23 14:21:50 [scrapy] INFO: Enabled item pipelines: 
2016-01-23 14:21:50 [scrapy] INFO: Spider opened 
2016-01-23 14:21:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-01-23 14:21:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-01-23 14:21:54 [scrapy] DEBUG: Redirecting (302) to <GET http://127.0.0.1> from <GET http://www.bridgebase.com/vugraph/schedule.php> 
2016-01-23 14:21:54 [scrapy] DEBUG: Retrying <GET http://127.0.0.1> (failed 1 times): Connection was refused by other side: 111: Connection refused. 

这可能是一个超级基本问题,但即使从头开始,我也遇到了很多麻烦。有没有人有从哪里开始的预感?

回答

1

您必须提供一个User-Agent标题才能假装成真正的浏览器。

您可以通过提供headers字典直接做在蜘蛛而返回scrapy.Requeststart_requests()

import scrapy 

class BboSpider(scrapy.Spider): 
    name = "bbo" 
    allowed_domains = ["bridgebase.com"] 

    def start_requests(self): 
     yield scrapy.Request("http://www.bridgebase.com/vugraph/schedule.php", headers={ 
      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36" 
     }) 

    # rules for parsing main response 
    def parse(self, response): 
     filename = 'test.html' 
     with open(filename, 'wb') as f: 
      f.write(response.body) 

或者,你可能只是设置USER_AGENT project setting

+0

完美。谢谢! – gogurt