如何摆脱exceptions.TypeError错误？

我正在使用Scrapy写一个刮板。我希望它做的事情之一是比较当前网页的根域和它内部链接的根域。如果这个域不同，那么它必须继续提取数据。这是我当前的代码：如何摆脱exceptions.TypeError错误？

class MySpider(Spider): 
    name = 'smm' 
    allowed_domains = ['*'] 
    start_urls = ['http://en.wikipedia.org/wiki/Social_media'] 
    def parse(self, response): 
     items = [] 
     for link in response.xpath("//a"): 
      #Extract the root domain for the main website from the canonical URL 
      hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract() 
      hostname1 = urlparse(hostname1).hostname 
      #Extract the root domain for thelink 
      hostname2 = link.xpath('@href').extract() 
      hostname2 = urlparse(hostname2).hostname 
      #Compare if the root domain of the website and the root domain of the link are different. 
      #If so, extract the items & build the dictionary 
      if hostname1 != hostname2: 
       item = SocialMediaItem() 
       item['SourceTitle'] = link.xpath('/html/head/title').extract() 
       item['TargetTitle'] = link.xpath('text()').extract() 
       item['link'] = link.xpath('@href').extract() 
       items.append(item) 
     return items

然而，当我运行它，我得到这个错误：

Traceback (most recent call last): 
    File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop 
    self.runUntilCurrent() 
    File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent 
    call.func(*call.args, **call.kw) 
    File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback 
    self._startRunCallbacks(result) 
    File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks 
    self._runCallbacks() 
--- <exception caught here> --- 
    File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse 
    hostname1 = urlparse(hostname1).hostname 
    File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse 
    tuple = urlsplit(url, scheme, allow_fragments) 
    File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit 
    cached = _parse_cache.get(key, None) 
exceptions.TypeError: unhashable type: 'list'

谁能帮我摆脱这种错误的？我认为它与列表键有关，但我不知道如何解决它。非常感谢你！

达尼

来源

2014-12-01 Dani Valverde

有几件事错在这里：

没有必要在循环计算hostname1，因为它总是选择相同的rel元素，即使在副切换使用（由于xpath表达式的性质，这是绝对的而不是相对的，但这是你需要的方式）。
hostname1的xpath表达式格式错误，它返回None，因此在尝试获取Kevin提出的第一个元素时出现错误。表达式中有两个单个词组，而不是一个转义单引号或双引号。
当您应该获得@href属性时，您将获得rel元素本身。应该修改XPath表达式来反映这一点。

解决这些问题后，代码可能看起来像这样（未测试）：

def parse(self, response): 
     items = [] 
     hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0] 
     hostname1 = urlparse(hostname1).hostname 

     for link in response.xpath("//a"): 
      hostname2 = (link.xpath('@href').extract() or [''])[0] 
      hostname2 = urlparse(hostname2).hostname 
      #Compare and extract 
      if hostname1 != hostname2: 
       ... 
     return items

来源

2014-12-05 09:52:33 bosnjak

谢谢劳伦斯。我会尽快尝试。 – 2014-12-05 16:40:52

@DaniValverde：是否有用？ – bosnjak 2014-12-09 10:44:41

嗨劳伦斯，我还没有试过，我在国外。我回来的时候会试试看。 – 2014-12-09 10:46:28

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract() 
hostname1 = urlparse(hostname1).hostname

extract返回一个字符串列表，但urlparse只接受一个字符串。也许你应该丢弃所有发现的第一个主机名。

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0] 
hostname1 = urlparse(hostname1).hostname

对于其他主机名同样如此。

hostname2 = link.xpath('@href').extract()[0] 
hostname2 = urlparse(hostname2).hostname

如果你不能确定该文件是否甚至有一个主机名，它可能是你三思而后行非常有用。

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract() 
if not hostname1: continue 
hostname1 = urlparse(hostname1[0]).hostname 

hostname2 = link.xpath('@href').extract() 
if not hostname2: continue 
hostname2 = urlparse(hostname2[0]).hostname

来源

2014-12-01 15:46:18 Kevin

我想你的建议，但我得到这个错误：exceptions.IndexError：列表索引超出范围 – 2014-12-01 15:49:53

换句话说 - 当您尝试执行[0]下标时，列表为零。 – 2014-12-01 15:53:00

我没想过。但是，它什么都不返回。这很奇怪，因为源页面有一个规范地址（因此hostname1），并且有很多链接可以解析URL。 – 2014-12-01 16:05:23

如何摆脱exceptions.TypeError错误？

回答

相关问题