链接我已经创建了一个扩展CrawlSpider,随后在http://scrapy.readthedocs.org/en/latest/topics/spiders.html不能遵循使用Scrapy
问题的建议蜘蛛是我需要解析两个起始URL(这恰好吻合与主机名)以及它所拥有的一些链接。
所以我定义了一条规则:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)]
,但没有任何反应。
然后我试着定义一组规则,如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]
。现在的问题是,蜘蛛解析一切。
我该如何告诉蜘蛛解析_start_url_以及它只包含一些链接?
更新:
我试图重写parse_start_url
方法,所以现在我能够从一开始就获得页面的数据,但它仍然没有遵循与Rule
定义的链接:
class ExampleSpider(CrawlSpider):
name = 'TechCrunchCrawler'
start_urls = ['http://techcrunch.com']
allowed_domains = ['techcrunch.com']
rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]
def parse_start_url(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
return self.parse_links(response)
def parse_links(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
articles = []
for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
article = Article()
article['title'] = i.select('./@title').extract()
article['link'] = i.select('./@href').extract()
articles.append(article)
return articles
ü可以张贴一些的我们的代码在这里识别以及 – 2012-07-10 09:36:28