为什么Scrapy只抓取start_urls然后停止?有没有办法让Scrapy爬过网站目录树中的所有页面,比如http://www.example.com/directory?或者,有没有办法让Scrapy更深入地抓住start_urls页面上的所有链接?Python Scrapy只抓取start_urls然后停止。如何更深入?
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
rules = [
Rule(SgmlLinkExtractor(allow=('',)), follow=True),
Rule(SgmlLinkExtractor(allow=('',)), callback='parse_item')
]
def parse_item(self, response):
print response.url
def parse(self, response):
print response.url
下面是我的main.py文件中的代码:
dmozSpider = DmozSpider()
spider = dmozSpider
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()