1
新手scrapy在这里。我试图从桥梁网站上抓取一些基本数据,出于某种原因,我一直在重定向回localhost。scrapy重定向到127.0.0.1
对于大多数其他网站(例如教程中的dmoz示例),这种情况不会发生。我的直觉是我没有设置一些东西来处理相关域名。我的蜘蛛(几乎完全一样的一个教程,除了与网址更改):
import scrapy
class BboSpider(scrapy.Spider):
name = "bbo"
allowed_domains = ["bridgebase.com"]
start_urls = [
"http://www.bridgebase.com/vugraph/schedule.php"
]
# rules for parsing main response
def parse(self, response):
filename = 'test.html'
with open(filename, 'wb') as f:
f.write(response.body)
我得到错误的是(相关部分):
2016-01-23 14:21:50 [scrapy] INFO: Scrapy 1.0.4 started (bot: bbo)
2016-01-23 14:21:50 [scrapy] INFO: Optional features available: ssl, http11
2016-01-23 14:21:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bbo.spiders', 'SPIDER_MODULES': ['bbo.spiders'], 'BOT_NAME': 'bbo'}
2016-01-23 14:21:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-23 14:21:50 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-23 14:21:50 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-23 14:21:50 [scrapy] INFO: Enabled item pipelines:
2016-01-23 14:21:50 [scrapy] INFO: Spider opened
2016-01-23 14:21:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-23 14:21:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-23 14:21:54 [scrapy] DEBUG: Redirecting (302) to <GET http://127.0.0.1> from <GET http://www.bridgebase.com/vugraph/schedule.php>
2016-01-23 14:21:54 [scrapy] DEBUG: Retrying <GET http://127.0.0.1> (failed 1 times): Connection was refused by other side: 111: Connection refused.
这可能是一个超级基本问题,但即使从头开始,我也遇到了很多麻烦。有没有人有从哪里开始的预感?
完美。谢谢! – gogurt