2016-03-01 34 views
1

我试图从allabolag.se中提取一些数据。我想要关注例如http://www.allabolag.se/5565794400/befattningar但scrapy不能正确地获取链接。它在URL中的“%2”后面缺少“52”。Scrapy - 由于编码无法关注链接

例子,我想去: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

但scrapy到达下面的链接:https://www.owasp.org/index.php/Double_Encoding

http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

我在这个网站,它得到的东西做的编码读我如何解决这个问题?

我的代码如下:

# -*- coding: utf-8 -*- 

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from allabolag.items import AllabolagItem 
from scrapy.loader.processors import Join 


class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'), 
    ) 

    def parse_link(self, response): 
     for sel in response.xpath('//*[@id="printContent"]'): 
      item = AllabolagItem() 
      item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      yield item 
+0

抓取时是否出现错误? – Rahul

回答

2

您可以配置链接提取不通过传递规范化的URL canonicalize=False

Scrapy shell会话:

$ scrapy shell http://www.allabolag.se/5565794400/befattningar 
>>> from scrapy.linkextractors import LinkExtractor 
>>> le = LinkExtractor() 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 
>>> 

>>> le = LinkExtractor(canonicalize=False) 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 

所以你应该好搭配:

class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", 
          restrict_xpaths=('//*[@id="printContent"]//a[1]'), 
          canonicalize=False), 
      callback='parse_link'), 
    ) 
    ... 
+0

非常感谢,解决了这个问题! 顺便说一句,请注意,你忘了以下行后的逗号:“restrict_xpaths =('// * [@ id =”printContent“] // a [1]')” – brrrglund

+0

哦,对!谢谢。固定 –