Scrapy - 由于编码无法关注链接

我试图从allabolag.se中提取一些数据。我想要关注例如http://www.allabolag.se/5565794400/befattningar但scrapy不能正确地获取链接。它在URL中的“％2”后面缺少“52”。Scrapy - 由于编码无法关注链接

例子，我想去： http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

但scrapy到达下面的链接：https://www.owasp.org/index.php/Double_Encoding

： http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

我在这个网站，它得到的东西做的编码读我如何解决这个问题？

我的代码如下：

# -*- coding: utf-8 -*- 

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from allabolag.items import AllabolagItem 
from scrapy.loader.processors import Join 


class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'), 
    ) 

    def parse_link(self, response): 
     for sel in response.xpath('//*[@id="printContent"]'): 
      item = AllabolagItem() 
      item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      yield item

来源

2016-03-01 brrrglund

抓取时是否出现错误？ – Rahul

您可以配置链接提取不通过传递规范化的URL canonicalize=False

Scrapy shell会话：

$ scrapy shell http://www.allabolag.se/5565794400/befattningar 
>>> from scrapy.linkextractors import LinkExtractor 
>>> le = LinkExtractor() 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 
>>> 

>>> le = LinkExtractor(canonicalize=False) 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)

所以你应该好搭配：

class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", 
          restrict_xpaths=('//*[@id="printContent"]//a[1]'), 
          canonicalize=False), 
      callback='parse_link'), 
    ) 
    ...

来源

2016-03-02 10:52:37

非常感谢，解决了这个问题！顺便说一句，请注意，你忘了以下行后的逗号：“restrict_xpaths =（'// * [@ id =”printContent“] // a [1]'）” – brrrglund

哦，对！谢谢。固定 –

Scrapy - 由于编码无法关注链接

回答

相关问题