2013-07-22 15 views
2

我试图穿过蟒蛇代理服务器去提取从网站上的一些信息,到目前为止,我这一段代码,但它不似乎工作获得通过代理服务器与Python

import requests 
import BeautifulSoup 



URL = 'http://proxy.library.upenn.edu/login?url=http://clients1.ibisworld.com/' 

session = requests.session() 


    # This is the form data that the page sends when logging in 
login_data = { 
    'pennkey': "****", 
    'password': "****", 
    'submit': 'login', 
} 

    # Authenticate 
r = session.post(URL, data=login_data) 


doc = BeautifulSoup.BeautifulSoup(r.content) 


print doc 

编辑:这是这个打印:

Gorkems-MacBook-Pro:desktop gorkemyurtseven$ python extract.py 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html> 
<head> 
<meta name="HandheldFriendly" content="True" /> 
<meta name="viewport" content="width=device-width, height=device-height, user-scalable=yes, minimum-scale=.5" /> 
<title>University of Pennsylvania Libraries Proxy Service - Login</title> 
<link href="/public/proxysm.css" media="print, screen" rel="stylesheet" type="text/css" /> 
<script language="javascript"> 
    function validate(){ 
     var isgoldcard = document.authenticate.pass.value; 
     var isgoldcardRegxp = /00000/;  
     if (isgoldcardRegxp.test(isgoldcard) == true) 
     alert("Authentication is by PennKey only."); 
    } 
</script> 
<script type="text/javascript"> 
    var _gaq = _gaq || []; 
    _gaq.push(['_setAccount', 'UA-982196-4']); 
    _gaq.push(['_trackPageview']); 

    (function() { 
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; 
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; 
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); 
    })(); 

</script> 
<!--[if IE]&gt; 
&lt;style&gt; 
table, form .limitwidth {width: 252px;} 
.holdsubmit {width: 143px;} 
&lt;/style&gt; 
&lt;![endif]--> 
</head> 
<body onload="document.authenticate.user.focus();"> 
<div id="logostripe"> 
<div><a href="http://www.library.upenn.edu/"><img src="/public/librarieslogologin.gif" border="0" alt="Penn Libraries Home" /></a></div> 
</div> 
<h1>Libraries Proxy Service</h1> 
<div id="holder"> 
<form name="authenticate" action="https://proxy.library.upenn.edu/login" method="post" autocomplete="off"> 
<div class="limitwidth"> 
<input type="hidden" name="url" value="http://clients1.ibisworld.com/" /> 
<script type="text/javascript"> 
      var t = location.search; 
      t = t.substr(t.indexOf('proxySessionID')+15,t.indexOf('&amp;')-16); 
      document.cookie="proxySessionID="+escape(t)+"; path=/; domain=.library.upenn.edu"; 
     </script> 
<table align="center" cellspacing="0" cellpadding="2" border="0"> 
<tr> 
<td class="holdlabels"><label for="user">PennKey:</label></td> 
<td><input type="text" name="user" /></td> 
</tr> 
<tr> 
<td><label for="password">Password:</label></td> 
<td><input type="password" name="pass" onblur="validate(); return false;" /></td> 
</tr> 
<tr> 
<td></td> 
<td class="holdsubmit"> 
<div><input type="submit" value="Login" /></div> 
</td> 
</tr> 
</table> 
</div> 
</form> 
<ul class="moreinfo"> 
<li><a class="menuitem" href="http://www.upenn.edu/computing/pennkey">PennKey information</a></li> 
</ul> 
<div class="notices"> 
    The Library Proxy Service allows you to use 
domain-restricted resources &amp; services by authenticating yourself as Penn Faculty, 
Student, or Staff. 
</div> 
<div class="alert"> 

Please note limitations on the use of restricted online resources. 
<br /><br /> 
PennKey holders must be current faculty, student, or staff, have valid University PennCommunity credentials and abide by stated <a href="http://www.library.upenn.edu/policies/appropriate-use-policy.html">Restrictions On Use</a>. 
<br /><br /> 
In addition, users agree to the <a href="http://www.upenn.edu/computing/policy/aup.html">University's Appropriate Use Policy</a>. 
</div> 
</div><!-- close holder --> 
</body> 
</html> 
+0

什么是错误? – aestrivex

+0

刚刚编辑我的问题.. –

回答

0

下面是对我的作品(也使用Penn的代理服务器)的解决方案:

import requests 
from bs4 import BeautifulSoup 

proxies = {'https': 'https://proxy.library.upenn.edu'} 
auth = requests.auth.HTTPProxyAuth('[username]', '[password]') 
r = requests.get('http://www.example.com/', proxies=proxies, auth=auth) 
print BeautifulSoup(r.content) 

第一个关键是代理服务器是https,而不是http(这花了我太多时间才能弄清楚)。接下来,你需要使用requests.auth.HTTPProxyAuth方法进行服务器验证。有一次,您设置了这两个增值税,但是,您应该能够随时随地进行导航。