2016-12-06 144 views
0

我正在Python 2.7中编写一个简单的webcrawler,并且在尝试从HTTPS网站检索robots.txt文件时正在获取SSL证书验证失败异常。RobotParser引发SSL证书验证失败异常

下面是相关代码:

def getHTMLpage(pagelink, currenttime): 
    "Downloads HTML page from server" 
    #init 
    #parse URL and get domain name 
    o = urlparse.urlparse(pagelink,"http") 
    if o.netloc == "": 
     netloc = re.search(r"[^/]+\.[^/]+\.[^/]+", o.path) 
     if netloc: 
      domainname="http://"+netloc.group(0)+"/" 
    else: 
     domainname=o.scheme+"://"+o.netloc+"/" 
    if o.netloc != "" and o.netloc != None and o.scheme != "mailto": #if netloc isn't empty and it's not a mailto link 
     link=domainname+o.path[1:]+o.params+"?"+o.query+"#"+o.fragment 
     if not (robotfiledictionary.get(domainname)): #if robot file for domainname was not downloaded 
      robotfiledictionary[domainname] = robotparser.RobotFileParser() #initialize robots.txt parser 
      robotfiledictionary[domainname].set_url(domainname+"robots.txt") #set url for robots.txt 
      print " Robots.txt for %s initial download" % str(domainname) 
      robotfiledictionary[domainname].read() #download/read robots.txt 
     elif (robotfiledictionary.get(domainname)): #if robot file for domainname was already downloaded 
      if (currenttime - robotfiledictionary[domainname].mtime()) > 3600: #if robot file is older than 1 hour 
       robotfiledictionary[domainname].read() #download/read robots.txt 
       print " Robots.txt for %s downloaded" % str(domainname) 
       robotfiledictionary[domainname].modified() #update time 
     if robotfiledictionary[domainname].can_fetch("WebCrawlerUserAgent", link): #if access is allowed... 
      #fetch page 
      print link 
      page = requests.get(link, verify=False) 
      return page.text() 
     else: #otherwise, report 
      print " URL disallowed due to robots.txt from %s" % str(domainname) 
      return "URL disallowed due to robots.txt" 
    else: #if netloc was empty, URL wasn't parsed. report 
     print "URL not parsed: %s" % str(pagelink) 
     return "URL not parsed" 

而这里的,我发现了异常:

Robots.txt for https://ehi-siegel.de/ initial download 
Traceback (most recent call last): 
    File "C:\webcrawler.py", line 561, in <module> 
    HTMLpage = getHTMLpage(link, loopstarttime) 
    File "C:\webcrawler.py", line 122, in getHTMLpage 
    robotfiledictionary[domainname].read() #download/read robots.txt 
    File "C:\Python27\lib\robotparser.py", line 58, in read 
    f = opener.open(self.url) 
    File "C:\Python27\lib\urllib.py", line 213, in open 
    return getattr(self, name)(url) 
    File "C:\Python27\lib\urllib.py", line 443, in open_https 
    h.endheaders(data) 
    File "C:\Python27\lib\httplib.py", line 1053, in endheaders 
    self._send_output(message_body) 
    File "C:\Python27\lib\httplib.py", line 897, in _send_output 
    self.send(msg) 
    File "C:\Python27\lib\httplib.py", line 859, in send 
    self.connect() 
    File "C:\Python27\lib\httplib.py", line 1278, in connect 
    server_hostname=server_hostname) 
    File "C:\Python27\lib\ssl.py", line 353, in wrap_socket 
    _context=self) 
    File "C:\Python27\lib\ssl.py", line 601, in __init__ 
    self.do_handshake() 
    File "C:\Python27\lib\ssl.py", line 830, in do_handshake 
    self._sslobj.do_handshake() 
IOError: [Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) 

正如你所看到的,我已经改变了代码在结束检索页面忽略SSL证书(我知道这是在生产中皱起眉头,但我想测试它),但现在似乎robotparser.read()函数未通过SSL验证。我已经看到我可以手动下载证书并指出该方向以验证SSL证书,但理想情况下,我想让我的程序“现成”,因为我本人不会成为一个使用它。有谁知道该怎么做?

编辑:我进入了robotparser.py。我加

import requests 

,改变线58

f = requests.get(self.url, verify=False) 

,这似乎已经解决了。这仍然不是很理想,所以我仍然乐于接受关于如何做的建议。

回答

0

我自己找到了解决方案。使用urllib3的请求功能,我能够认证所有网站并继续访问它们。

我仍然必须编辑robotparser.py文件。这是我加入到开头:

import urllib3 
import urllib3.contrib.pyopenssl 
import certifi 
urllib3.contrib.pyopenssl.inject_into_urllib3() 
http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certifi.where()) 

这是读的定义(个体经营):

def read(self): 
    """Reads the robots.txt URL and feeds it to the parser.""" 
    opener = URLopener() 
    f = http.request('GET', self.url) 
    lines = [line.strip() for line in f.data] 
    f.close() 
    self.errcode = opener.errcode 
    if self.errcode in (401, 403): 
     self.disallow_all = True 
    elif self.errcode >= 400 and self.errcode < 500: 
     self.allow_all = True 
    elif self.errcode == 200 and lines: 
     self.parse(lines) 

我也用同样的过程中得到我的程序的功能,实际的页面请求:

def getHTMLpage(pagelink, currenttime): 
    "Downloads HTML page from server" 
    #init 
    #parse URL and get domain name 
    o = urlparse.urlparse(pagelink,u"http") 
    if o.netloc == u"": 
     netloc = re.search(ur"[^/]+\.[^/]+\.[^/]+", o.path) 
     if netloc: 
      domainname=u"http://"+netloc.group(0)+u"/" 
    else: 
     domainname=o.scheme+u"://"+o.netloc+u"/" 
    if o.netloc != u"" and o.netloc != None and o.scheme != u"mailto": #if netloc isn't empty and it's not a mailto link 
     link=domainname+o.path[1:]+o.params+u"?"+o.query+u"#"+o.fragment 
     if not (robotfiledictionary.get(domainname)): #if robot file for domainname was not downloaded 
      robotfiledictionary[domainname] = robotparser.RobotFileParser() #initialize robots.txt parser 
      robotfiledictionary[domainname].set_url(domainname+u"robots.txt") #set url for robots.txt 
      print u" Robots.txt for %s initial download" % str(domainname) 
      robotfiledictionary[domainname].read() #download/read robots.txt 
     elif (robotfiledictionary.get(domainname)): #if robot file for domainname was already downloaded 
      if (currenttime - robotfiledictionary[domainname].mtime()) > 3600: #if robot file is older than 1 hour 
       robotfiledictionary[domainname].read() #download/read robots.txt 
       print u" Robots.txt for %s downloaded" % str(domainname) 
       robotfiledictionary[domainname].modified() #update time 
     if robotfiledictionary[domainname].can_fetch("WebCrawlerUserAgent", link.encode('utf-8')): #if access is allowed... 
      #fetch page 
      if domainname == u"https://www.otto.de/" or domainname == u"http://www.otto.de": 
       driver.get(link.encode('utf-8')) 
       time.sleep(5) 
       page=driver.page_source 
       return page 
      else: 
       page = http.request('GET',link.encode('utf-8')) 
       return page.data.decode('UTF-8','ignore') 
     else: #otherwise, report 
      print u" URL disallowed due to robots.txt from %s" % str(domainname) 
      return u"URL disallowed due to robots.txt" 
    else: #if netloc was empty, URL wasn't parsed. report 
     print u"URL not parsed: %s" % str(pagelink) 
     return u"URL not parsed" 

您还会注意到我改变了我的程序中使用严格UTF-8,但这是无关的。