链接与空间之前和之后不被解析正确

我有一个网站我爬行之前和在URL后链接与空间之前和之后不被解析正确

<a href=" /c/96894 ">Test</a>

而不是爬行这有一个空格：

http://www.stores.com/c/96894/

它抓取此：

http://www.store.com/c/%0A%0A/c/96894%0A%0A

此外，它会导致包含相同的链接，如下所示的链接无限循环：

http://www.store.com/cp/%0A%0A/cp/96894%0A%0A/cp/96894%0A%0A

前后的URL是所有浏览器都忽略了后面的所有空白（\r，\n，\t和空间）。如何修改抓取的网址的空白处？

这是我的代码。

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

from wallspider.items import Website 

class StoreSpider(CrawlSpider): 
    name = "cpages" 
    allowed_domains = ["www.store.com"] 
    start_urls = ["http://www.sore.com",] 

    rules = (
    Rule (SgmlLinkExtractor(allow=('/c/',),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page=',)) 
    , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],), 
    Rule(SgmlLinkExtractor(allow=(),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page='))), 
    ) 

    def parse_items(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//html') 
     items = [] 

     for site in sites: 
      item = Website() 
      item['url'] = response.url 
      item['referer'] = response.request.headers.get('Referer') 
      item['anchor'] = response.meta.get('link_text') 
      item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract() 
      item['robots'] = site.select('//meta[@name="robots"]/@content').extract() 
      items.append(item) 

     return items

来源

2014-09-23 Jason Youk

您使用的是什么代码？ – hyades 2014-09-23 19:14:42

显示你的一些代码。你可以使用're/string replace'作为你的工作。 – 2014-09-23 19:22:20

我用process_value = cleanurl我LinkExtractor例如

def cleanurl(link_text): 
    return link_text.strip("\t\r\n ")

代码如果有人遇到同样的问题：

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

from wallspider.items import Website 


class storeSpider(CrawlSpider): 
    name = "cppages" 
    allowed_domains = ["www.store.com"] 
    start_urls = ["http://www.store.com",] 

    def cleanurl(link_text): 
     return link_text.strip("\t\r\n '\"") 

    rules = (
    Rule (SgmlLinkExtractor(allow=('/cp/',),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page=',), process_value=cleanurl) 
    , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],), 
    Rule(SgmlLinkExtractor(allow=('/cp/', '/browse/',),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page='), process_value=cleanurl)), 
    ) 

    def parse_items(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//html') 
     items = [] 

     for site in sites: 
      item = Website() 
      item['url'] = response.url 
      item['referer'] = response.request.headers.get('Referer') 
      item['anchor'] = response.meta.get('link_text') 
      item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract() 
      item['robots'] = site.select('//meta[@name="robots"]/@content').extract() 
      items.append(item) 

     return items

来源

2014-09-23 20:59:52

你可以''像更换空白，

url = response.url 
item['url'] = url.replace(' ', '')

或者，使用正则表达式，

import re 
url = response.url 
item['url'] = re.sub(r'\s', '', url)

来源

2014-09-23 20:34:27

对不起，我应该澄清。解析信息不是问题，而是scrapy抓取相对间隔URL的方式。 – 2014-09-23 20:39:09

链接与空间之前和之后不被解析正确

回答

相关问题