2012-01-18 70 views
6

The script(below)from this教程包含两个start_urlsScrapy start_urls

from scrapy.spider import Spider 
from scrapy.selector import Selector 

from dirbot.items import Website 

class DmozSpider(Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", 
    ] 

    def parse(self, response): 
     """ 
     The lines below is a spider contract. For more info see: 
     http://doc.scrapy.org/en/latest/topics/contracts.html 
     @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ 
     @scrapes name 
     """ 
     sel = Selector(response) 
     sites = sel.xpath('//ul[@class="directory-url"]/li') 
     items = [] 

     for site in sites: 
      item = Website() 
      item['name'] = site.xpath('a/text()').extract() 
      item['url'] = site.xpath('a/@href').extract() 
      item['description'] = site.xpath('text()').re('-\s[^\n]*\\r') 
      items.append(item) 

     return items 

但为什么只刮这2个网页?我看到allowed_domains = ["dmoz.org"],但这两个页面还包含指向dmoz.org域内的其他页面的链接!为什么不把它们刮掉?

回答

2

该课程没有rules属性。看看http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html并搜索“规则”以查找示例。

+0

http://doc.scrapy.org/en/latest/topics/spiders.html但规​​则适用于CrawlSpider!我从BaseSpider继承! – DrStrangeLove 2012-01-18 01:00:22

+0

BaseSpider只会提供所提供的开始网址,所以我想我的原始答案有点误导。请参阅http://doc.scrapy.org/en/latest/topics/spiders.html#basespider – Glenn 2012-01-18 01:10:26

+0

,但它会介绍start_urls:随后的URL将从包含在起始URL中的数据中连续生成。为什么不刮那些(后续)网址? (当然,如果这些网址在dmoz.org域内) – DrStrangeLove 2012-01-18 01:41:29

2

如果您在回调中使用BaseSpider,则必须自行提取出所需的url并返回Request对象。

如果使用CrawlSpider,则链接提取将由规则和与规则关联的SgmlLinkExtractor负责。

15

start_urls class属性包含启动url - 仅此而已。如果你提取的其他网页的网址,你想刮 - 产量从[其他]回调parse回调对应的请求:

class Spider(BaseSpider): 

    name = 'my_spider' 
    start_urls = [ 
       'http://www.domain.com/' 
    ] 
    allowed_domains = ['domain.com'] 

    def parse(self, response): 
     '''Parse main page and extract categories links.''' 
     hxs = HtmlXPathSelector(response) 
     urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract() 
     for url in urls: 
      url = urlparse.urljoin(response.url, url) 
      self.log('Found category url: %s' % url) 
      yield Request(url, callback = self.parseCategory) 

    def parseCategory(self, response): 
     '''Parse category page and extract links of the items.''' 
     hxs = HtmlXPathSelector(response) 
     links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract() 
     for link in links: 
      itemLink = urlparse.urljoin(response.url, link) 
      self.log('Found item link: %s' % itemLink, log.DEBUG) 
      yield Request(itemLink, callback = self.parseItem) 

    def parseItem(self, response): 
     ... 

如果仍然要自定义开始请求创建,重写方法BaseSpider.start_requests()

1

如果您使用规则来跟踪链接(已经在scrapy中实现),蜘蛛也会刮掉它们。我希望帮助...

from scrapy.contrib.spiders import BaseSpider, Rule 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
    from scrapy.selector import HtmlXPathSelector 


    class Spider(BaseSpider): 
     name = 'my_spider' 
     start_urls = ['http://www.domain.com/'] 
     allowed_domains = ['domain.com'] 
     rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)] 

    ... 
0

你没有写的函数来处理你想get.so双向至reslolve.1.use的规则(crawlspider)2什么网址:写函数来处理新的url并将它们放入回调函数中。