2015-11-06 81 views
1

我有以下Python代码旨在用于网络爬行,当我尝试运行这一个,它是抛出我以下错误。 代码Python的网络爬行是抛出连接错误

import lxml.html 
import requests 
from bs4 import BeautifulSoup 

url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;' 
url2 ='page=' 
url3 ='size=200;template=results;type=batting' 
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting'] 
for i in range(2,3854): 
    url4 = url1 + url2 + str(i) + ';' + url3 
    url5.append(url4) 
for page in url5: 
     source_code = requests.get(page, verify=False) 
    # just get the code, no headers or anything 
     plain_text = source_code.text 
    # BeautifulSoup objects can be sorted through easy 
     soup = BeautifulSoup(plain_text, "lxml") 
     for link in soup.findAll('a', {'class': 'data-link'}): 
       href = "https://www.espncricinfo.com" + link.get('href') 
       title = link.string # just the text, not the HTML 
       source_code = requests.get(href) 
       plain_text = source_code.text 
       soup = BeautifulSoup(plain_text, "lxml") 
# if you want to gather information from that page 
       for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}): 
          print(item_name.string) 

错误:

Traceback (most recent call last): 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen 
    body=body, headers=headers) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request 
    self._validate_conn(conn) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn 
    conn.connect() 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect 
    match_hostname(cert, self.assert_hostname or hostname) 
    File "C:\Python34\lib\ssl.py", line 285, in match_hostname 
    % (hostname, ', '.join(map(repr, dnsnames)))) 
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net' 

在处理上述异常,又发生了异常:

Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send 
    timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen 
    raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net' 

在处理上述异常,另一个异常发生:

Traceback (most recent call last): 
    File "C:/Python34/intplayername.py", line 23, in <module> 
    source_code = requests.get(href) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get 
    return request('get', url, params=params, **kwargs) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request 
    response = session.request(method=method, url=url, **kwargs) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request 
    resp = self.send(prep, **send_kwargs) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send 
    r = adapter.send(request, **kwargs) 
    File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send 
    raise SSLError(e, request=request) 
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net' 

回答

4

这是由于您想要爬网的网站上的https证书配置错误所致。作为一种变通方法,您可以关闭证书中的requests

requests.get(href, verify=False) 

被告知请检查,这时候你有敏感信息的工作不是一个推荐的做法。

+0

谢谢。我按照您的建议进行了更改,并收到以下警告。 警告(来自警告模块): 文件“C:\ Python34 \ lib \ site-packages \ requests-2.8.0-py3.4.egg \ requests \ packages \ urllib3 \ connectionpool.py”,第789行 InsecureRequestWarning ) InsecureRequestWarning:未经验证的HTTPS请求正在进行。强烈建议添加证书验证。请参阅:https://urllib3.readthedocs.org/en/latest/security.html –

+0

在页面的下方,您可以找到如何禁用警告:https://urllib3.readthedocs.org/en/latest/security.html#禁用的警告 – kosii