The script(below)from this教程包含两个start_urls
。Scrapy start_urls
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)
return items
但为什么只刮这2个网页?我看到allowed_domains = ["dmoz.org"]
,但这两个页面还包含指向dmoz.org
域内的其他页面的链接!为什么不把它们刮掉?
http://doc.scrapy.org/en/latest/topics/spiders.html但规则适用于CrawlSpider!我从BaseSpider继承! – DrStrangeLove 2012-01-18 01:00:22
BaseSpider只会提供所提供的开始网址,所以我想我的原始答案有点误导。请参阅http://doc.scrapy.org/en/latest/topics/spiders.html#basespider – Glenn 2012-01-18 01:10:26
,但它会介绍start_urls:随后的URL将从包含在起始URL中的数据中连续生成。为什么不刮那些(后续)网址? (当然,如果这些网址在dmoz.org域内) – DrStrangeLove 2012-01-18 01:41:29