Scrapy start_urls

The script（below）from this教程包含两个start_urls。Scrapy start_urls

from scrapy.spider import Spider 
from scrapy.selector import Selector 

from dirbot.items import Website 

class DmozSpider(Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", 
    ] 

    def parse(self, response): 
     """ 
     The lines below is a spider contract. For more info see: 
     http://doc.scrapy.org/en/latest/topics/contracts.html 
     @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ 
     @scrapes name 
     """ 
     sel = Selector(response) 
     sites = sel.xpath('//ul[@class="directory-url"]/li') 
     items = [] 

     for site in sites: 
      item = Website() 
      item['name'] = site.xpath('a/text()').extract() 
      item['url'] = site.xpath('a/@href').extract() 
      item['description'] = site.xpath('text()').re('-\s[^\n]*\\r') 
      items.append(item) 

     return items

但为什么只刮这2个网页？我看到allowed_domains = ["dmoz.org"]，但这两个页面还包含指向dmoz.org域内的其他页面的链接！为什么不把它们刮掉？

来源

2012-01-18 DrStrangeLove

该课程没有rules属性。看看http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html并搜索“规则”以查找示例。

来源

2012-01-18 00:58:02 Glenn

http://doc.scrapy.org/en/latest/topics/spiders.html但规则适用于CrawlSpider！我从BaseSpider继承！ – DrStrangeLove 2012-01-18 01:00:22

BaseSpider只会提供所提供的开始网址，所以我想我的原始答案有点误导。请参阅http://doc.scrapy.org/en/latest/topics/spiders.html#basespider – Glenn 2012-01-18 01:10:26

，但它会介绍start_urls：随后的URL将从包含在起始URL中的数据中连续生成。为什么不刮那些（后续）网址？（当然，如果这些网址在dmoz.org域内） – DrStrangeLove 2012-01-18 01:41:29

如果您在回调中使用BaseSpider，则必须自行提取出所需的url并返回Request对象。

如果使用CrawlSpider，则链接提取将由规则和与规则关联的SgmlLinkExtractor负责。

来源

2012-01-18 06:04:43 goh

start_urls class属性包含启动url - 仅此而已。如果你提取的其他网页的网址，你想刮 - 产量从[其他]回调parse回调对应的请求：

class Spider(BaseSpider): 

    name = 'my_spider' 
    start_urls = [ 
       'http://www.domain.com/' 
    ] 
    allowed_domains = ['domain.com'] 

    def parse(self, response): 
     '''Parse main page and extract categories links.''' 
     hxs = HtmlXPathSelector(response) 
     urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract() 
     for url in urls: 
      url = urlparse.urljoin(response.url, url) 
      self.log('Found category url: %s' % url) 
      yield Request(url, callback = self.parseCategory) 

    def parseCategory(self, response): 
     '''Parse category page and extract links of the items.''' 
     hxs = HtmlXPathSelector(response) 
     links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract() 
     for link in links: 
      itemLink = urlparse.urljoin(response.url, link) 
      self.log('Found item link: %s' % itemLink, log.DEBUG) 
      yield Request(itemLink, callback = self.parseItem) 

    def parseItem(self, response): 
     ...

如果仍然要自定义开始请求创建，重写方法BaseSpider.start_requests()

来源

2012-01-18 06:29:19 warvariuc

start_urls包含蜘蛛开始爬行的链接。如果要递归爬网，您应该使用爬网搜索器并为其定义规则。例如 http://doc.scrapy.org/en/latest/topics/spiders.html 。

来源

2013-09-10 09:49:17

如果您使用规则来跟踪链接（已经在scrapy中实现），蜘蛛也会刮掉它们。我希望帮助...

from scrapy.contrib.spiders import BaseSpider, Rule 
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
    from scrapy.selector import HtmlXPathSelector 


    class Spider(BaseSpider): 
     name = 'my_spider' 
     start_urls = ['http://www.domain.com/'] 
     allowed_domains = ['domain.com'] 
     rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)] 

    ...

来源

2014-11-26 23:36:37 francisco

你没有写的函数来处理你想get.so双向至reslolve.1.use的规则（crawlspider）2什么网址：写函数来处理新的url并将它们放入回调函数中。

来源

2017-07-10 07:12:20 Gavin

Scrapy start_urls

回答

相关问题