2012-08-15 93 views
7

我想从运行在Ubuntu服务器上的脚本登录到我的雅虎帐户。我试图用机械化来使用python,但是我的计划中存在缺陷。如何从Ubuntu服务器以编程方式登录

这是我现在的代码。

 loginurl = "https://login.yahoo.com/config/login" 
     br = mechanize.Browser() 
     cj = cookielib.LWPCookieJar() 
     br.set_cookiejar(cj) 
     br.set_handle_equiv(True) 
     br.set_handle_gzip(True) 
     br.set_handle_redirect(True) 
     br.set_handle_referer(True) 
     br.set_handle_robots(False) 
     br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) 
     br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
     r = br.open(loginurl) 
     html = r.read() 
     br.select_form(nr=0) 
     br.form['login']='[mylogin]' 
     br.form['passwd']='[mypassword]' 
     br.submit() 

     print br.response().read() 

我得到的回应是一个带有大胆红色文本阅读的雅虎登录页面。 “JavaScript必须在你的浏览器上启用”或类似的东西。机械化文档中有一部分提到用JS创建cookie的页面,但是帮助页面返回一个HTTP 400(只是我的运气)

搞清楚javascript做了什么,然后手动进行操作听起来像一个非常困难的任务。我会非常乐意切换到任何工具/语言,只要它可以在ubuntu服务器上运行即可。即使这意味着使用不同的工具进行登录,然后将登录cookie传递回我的python脚本。任何帮助/建议表示赞赏。

更新:

  • 我不希望使用雅虎的API

  • 我也试图与scrapy,但我认为同样的问题发生

我scrapy脚本

class YahooSpider(BaseSpider): 
name = "yahoo" 
start_urls = [ 
    "https://login.yahoo.com/config/login?.intl=us&.lang=en-US&.partner=&.last=&.src=&.pd=_ver%3D0%26c%3D%26ivt%3D%26sg%3D&pkg=&stepid=&.done=http%3a//my.yahoo.com" 
] 

def parse(self, response): 
    x = HtmlXPathSelector(response) 
    print x.select("//input/@value").extract() 
    return [FormRequest.from_response(response, 
       formdata={'login': '[my username]', 'passwd': '[mypassword]'}, 
       callback=self.after_login)] 

def after_login(self, response): 
    # check login succeed before going on 
    if response.url == 'http://my.yahoo.com': 
     return Request("[where i want to go next]", 
        callback=self.next_page, errback=self.error, dont_filter=True) 
    else: 
     print response.url 
     self.log("Login failed.", level=log.CRITICAL) 

def next_page(sekf, response): 
    x = HtmlXPathSelector(response) 
    print x.select("//title/text()").extract() 

的scrapy脚本只输出 “https://login.yahoo.com/config/login” ......嘘

+0

是不是有雅虎API的那种东西? – 2012-08-15 18:05:38

+0

是的,但不幸的是它的功能有限 – DrLazer 2012-08-15 19:11:03

+0

我没有任何问题使用你的脚本。 – xbb 2012-08-17 19:02:16

回答

3

我很惊讶,这个工程:

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> from BeautifulSoup import BeautifulSoup as BS 
>>> import requests 
>>> r = requests.get('https://login.yahoo.com/') 
>>> soup = BS(r.text) 
>>> login_form = soup.find('form', attrs={'name':'login_form'}) 
>>> hiddens = login_form.findAll('input', attrs={'type':'hidden'}) 
>>> payload = {} 
>>> for h in hiddens: 
...  payload[str(h.get('name'))] = str(h.get('value')) 
... 
>>> payload['login'] = '[email protected]' 
>>> payload['passwd'] = '********' 
>>> post_url = str(login_form.get('action')) 
>>> r2 = requests.post(post_url, cookies=r.cookies, data=payload) 
>>> r3 = requests.get('http://my.yahoo.com', cookies=r2.cookies) 
>>> page = r3.text 
>>> pos = page.find('testtest481') 
>>> print page[ pos - 50 : pos + 300 ] 
    You are signed in as: <span class="yuhead-yid">testtest481</span>  </li> </ul></li><li id="yuhead-me-signout" class="yuhead-me"><a href=" 
http://login.yahoo.com/config/login?logout=1&.direct=2&.done=http://www.yahoo.com&amp;.src=my&amp;.intl=us&amp;.lang=en-US" target="_top" rel="nofoll 
ow">   Sign Out  </a><img width='0' h 
>>> 

请给这个试试:

"""                   
ylogin.py - how-to-login-to-yahoo-programatically-from-an-ubuntu-server  

http://stackoverflow.com/questions/11974478/        
Test my.yahoo.com login using requests and BeautifulSoup.     
"""                   

from BeautifulSoup import BeautifulSoup as BS        
import requests                

CREDS = {'login': 'CHANGE ME',            
     'passwd': 'CHANGE ME'}            
URLS = {'login': 'https://login.yahoo.com/',        
     'post': 'https://login.yahoo.com/config/login?',     
     'home': 'http://my.yahoo.com/'}          

def test():                 
    cookies = get_logged_in_cookies()          
    req_with_logged_in_cookies = requests.get(URLS['home'], cookies=cookies)  
    assert 'You are signed in' in req_with_logged_in_cookies.text 
    print "If you can see this message you must be logged in." 

def get_logged_in_cookies():            
    req = requests.get(URLS['login'])          
    hidden_inputs = BS(req.text).find('form', attrs={'name':'login_form'})\ 
           .findAll('input', attrs={'type':'hidden'}) 
    data = dict(CREDS.items() + dict((h.get('name'), h.get('value')) \  
             for h in hidden_inputs).items()) 
    post_req = requests.post(URLS['post'], cookies=req.cookies, data=data) 
    return post_req.cookies             

test()                  

根据需要添加错误处理。

+0

我很惊讶,也有效。 我复制了您的脚本信件,只替换您的帐户与我的。我的结果是没有输出...因为“pos”是-1,我仍然看起来在登录页 唯一的区别我能想到的是我的Python版本是2.7.2 +和[GCC 4.6。 1] on linux2 – DrLazer 2012-08-22 17:30:14

+0

我已经添加了一个赏金,让事情变得有点。 – DrLazer 2012-08-22 17:36:36

+0

对不起,延迟。去银行假期周末去吧。 好吧,我已经尝试过你的修改脚本(感谢发布)。 – DrLazer 2012-08-28 15:24:05

1

你Scrapy脚本工作对我来说:

from scrapy.spider import BaseSpider 
from scrapy.http import FormRequest 
from scrapy.selector import HtmlXPathSelector 

class YahooSpider(BaseSpider): 
    name = "yahoo" 
    start_urls = [ 
     "https://login.yahoo.com/config/login?.intl=us&.lang=en-US&.partner=&.last=&.src=&.pd=_ver%3D0%26c%3D%26ivt%3D%26sg%3D&pkg=&stepid=&.done=http%3a//my.yahoo.com" 
    ] 

    def parse(self, response): 
     x = HtmlXPathSelector(response) 
     print x.select("//input/@value").extract() 
     return [FormRequest.from_response(response, 
        formdata={'login': '<username>', 'passwd': '<password>'}, 
        callback=self.after_login)] 

    def after_login(self, response): 
     self.log('Login successful: %s' % response.url) 

输出:

[email protected]:myproj$ scrapy crawl yahoo 
2012-08-22 20:55:31-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: drzyahoo) 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Enabled item pipelines: 
2012-08-22 20:55:31-0500 [yahoo] INFO: Spider opened 
2012-08-22 20:55:31-0500 [yahoo] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2012-08-22 20:55:31-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2012-08-22 20:55:32-0500 [yahoo] DEBUG: Crawled (200) <GET https://login.yahoo.com/config/login?.intl=us&.lang=en-US&.partner=&.last=&.src=&.pd=_ver%3D0%26c%3D%26ivt%3D%26sg%3D&pkg=&stepid=&.done=http%3a//my.yahoo.com> (referer: None) 
[u'1', u'', u'', u'', u'', u'', u'', u'us', u'en-US', u'', u'', u'93s42g583b3cg', u'0', u'L0iOlEQ1EbZ24TfLRpA43s5offgQ', u'', u'', u'', u'', u'', u'0', u'Y', u'http://my.yahoo.com', u'_ver=0&c=&ivt=&sg=', u'0', u'0', u'0', u'5', u'5', u'', u'y'] 
2012-08-22 20:55:32-0500 [yahoo] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <POST https://login.yahoo.com/config/login> 
2012-08-22 20:55:33-0500 [yahoo] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: https://login.yahoo.com/config/login?.intl=us&.lang=en-US&.partner=&.last=&.src=&.pd=_ver%3D0%26c%3D%26ivt%3D%26sg%3D&pkg=&stepid=&.done=http%3a//my.yahoo.com) 
2012-08-22 20:55:33-0500 [yahoo] DEBUG: Login successful: http://my.yahoo.com 
2012-08-22 20:55:33-0500 [yahoo] INFO: Closing spider (finished) 
2012-08-22 20:55:33-0500 [yahoo] INFO: Dumping spider stats: 
    {'downloader/request_bytes': 2447, 
    'downloader/request_count': 3, 
    'downloader/request_method_count/GET': 2, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 77766, 
    'downloader/response_count': 3, 
    'downloader/response_status_count/200': 3, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 8, 23, 1, 55, 33, 837619), 
    'request_depth_max': 1, 
    'scheduler/memory_enqueued': 3, 
    'start_time': datetime.datetime(2012, 8, 23, 1, 55, 31, 271262)} 

环境:

[email protected]:myproj$ scrapy version -v 
Scrapy : 0.15.1 
lxml : 2.3.2.0 
libxml2 : 2.7.8 
Twisted : 11.1.0 
Python : 2.7.3 (default, Aug 1 2012, 05:14:39) - [GCC 4.6.3] 
Platform: Linux-3.2.0-29-generic-x86_64-with-Ubuntu-12.04-precise 
1

启用时JS是必要的,并没有显示的情况下,phantomjs是一个不错的解决方案,认为它是js,而不是python:$

2

如果页面正在使用JavaScript,你可能会考虑使用类似ghost.py而不是请求或机械化。 ghost.py托管一个WebKit客户端,并且应该能够以最小的努力处理这些棘手的情况。

+0

哦,那很好,不知道有一个python相当于phantomjs :) – Tshirtman 2012-08-29 21:18:55

0

你可以试试PhantomJS,一个无Javascript的Javascript API http://phantomjs.org/它支持编程式的Javascript启用浏览。

0

为什么不使用FancyURLOpener?它处理标准的HTTP错误并具有prompt_user_passwd()函数。来自链接:

执行基本身份验证时,FancyURLopener实例将调用其prompt_user_passwd()方法。默认实现向用户询问控制终端上的所需信息。如果需要,子类可以重写此方法以支持更适当的行为。