2012-09-06 59 views
8

我无法使用urllib2打开一个特定的网址。同样的方法适用于其他网站,如“http://www.google.com”,但不适用于此网站(该网站在浏览器中也显示正常)。urllib2返回404为浏览器显示罚款的网站

我简单的代码:

from BeautifulSoup import BeautifulSoup 
import urllib2 

url="http://www.experts.scival.com/einstein/" 
response=urllib2.urlopen(url) 
html=response.read() 
soup=BeautifulSoup(html) 
print soup 

谁能帮我做工作?

这是错误我:

Traceback (most recent call last): 
    File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module> 
    response=urllib2.urlopen(url); 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen 
    return _opener.open(url, data, timeout) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open 
    response = meth(req, response) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error 
    result = self._call_chain(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain 
    result = func(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302 
    return self.parent.open(new, timeout=req.timeout) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open 
    response = meth(req, response) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error 
    return self._call_chain(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain 
    result = func(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default 
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) 
urllib2.HTTPError: HTTP Error 404: Not Found 

谢谢

+1

什么是你的错误? –

+3

停止在行尾添加分号。这是Python。 – FogleBird

+0

我的错是关于获取参数,但我认为不是你的问题 –

回答

8

我只是尝试这样做,并获得404码和页面回。

猜测它正在做用户代理检测,无论是意外还是故意不向python urllib提供内容。

澄清,与urllib,我收到urlopen返回与404代码和HTML内容的响应对象。 urllib2.urlopenurllib2.HTTPError异常被提出。

我建议您尝试将您的用户代理设置为看起来像浏览器的东西。这里有一个关于这个问题:Changing user agent on urllib2.urlopen

+0

这也是我的猜测,你打败了我。 – FogleBird

0

hm ...你确定这个URL是有效的吗?尝试“http://www.google.com”我有类似的代码,并没有与urllib问题。或者你可以使用try - except语句来查看错误的细节。当然MattH的答案是非常相似的真理:)

3

您可以使用try except捕获错误

try: 
    u = urllib2.urlopen(req) 
except urllib2.HTTPError, e: 
    print e.code 
    print e.msg 
    return 
相关问题