如何在使用python 2.7抓取URL时忽略HTTP错误

我正在抓取几个URL以在其源代码中查找特定关键字。然而，虽然爬行一半的网站，我的蜘蛛突然停止由于像404或503 HTTP错误如何在使用python 2.7抓取URL时忽略HTTP错误

我的履带：

import urllib2 

keyword = ['viewport'] 

with open('listofURLs.csv') as f: 
    for line in f: 
     strdomain = line.strip() 
     if strdomain: 
      req = urllib2.Request(strdomain.strip()) 
      response = urllib2.urlopen(req) 
      html_content = response.read() 

      for searchstring in keyword: 
       if searchstring.lower() in str(html_content).lower(): 
        print (strdomain, keyword, 'found') 

f.close()

我要补充什么代码，忽略与HTTP错误和出租不良网址爬行器继续cra？？

来源

2017-02-20 jakeT888

您可以在响应对象上调用getCode（）并使用条件来检查200状态。 – tobassist

@tobassist你能告诉我我特别需要哪些代码行吗？ – jakeT888

您可以使用try-except块作为证明here。这使您可以将您的逻辑应用于有效的URL，并将不同的逻辑应用于发生HTTP错误的URL。

将链接中的解决方案应用于您的代码。

import urllib2 

keyword = ['viewport'] 

with open('listofURLs.csv') as f: 
    for line in f: 
     strdomain = line.strip() 
     if strdomain: 
      req = urllib2.Request(strdomain.strip()) 
      try: 
       response = urllib2.urlopen(req) 
       html_content = response.read() 

       for searchstring in keyword: 
        if searchstring.lower() in str(html_content).lower(): 
         print (strdomain, keyword, 'found') 

      except urllib2.HTTPError, err: 
       # Do something here maybe print err.code 
f.close()

这是您提供的代码的正确解决方案。但是，eLRuLL提供了一个很好的观点，您应该考虑使用scrapy来满足您的网络爬行需求。

来源

2017-02-21 20:55:34 tobassist

谢谢！为什么scrapy比我的代码好得多？ – jakeT888

@ jakeT888'scrapy'包含处理大多数网络爬虫问题的所有工具和机制。在你的情况下，它已经处理了错误的响应状态而不会破坏你的网络爬虫。 – eLRuLL

我会建议使用Scrapy framework爬行目的

来源

2017-02-20 23:34:06 eLRuLL

如何在使用python 2.7抓取URL时忽略HTTP错误

回答

相关问题