2010-06-08 50 views
1

我试图使用Mechanize登录到Google Docs,以便可以抓取某些东西(不可能从API中获得),但在尝试遵循元重定向时,我似乎总是收到404:在Google Docs中使用机械化

require 'rubygems' 
require 'mechanize' 

USERNAME = "..." 
PASSWORD = "..." 

LOGIN_URL = "https://www.google.com/accounts/Login?hl=en&continue=http://docs.google.com/" 

agent = Mechanize.new 
login_page = agent.get(LOGIN_URL) 
login_form = login_page.forms.first 
login_form.Email = USERNAME 
login_form.Passwd = PASSWORD 
login_response_page = agent.submit(login_form) 

redirect = login_response_page.meta[0].uri.to_s 

puts "redirect: #{redirect}" 

followed_page = agent.get(redirect) # throws a HTTPNotFound exception 

pp followed_page 

任何人都可以看到为什么这不工作?

回答

4

安迪你真棒! 您的代码帮助我使脚本正常工作并登录到Google帐户。几个小时后我发现你的错误。它是关于html转义的。正如我发现的,机械化会自动将uri作为“get”方法的参数转义。所以我的解决方案是:

EMAIL = ".." 
PASSWD = ".." 
agent = Mechanize.new{ |a| a.log = Logger.new("mech.log")} 
agent.user_agent_alias = 'Linux Mozilla' 
agent.open_timeout = 3 
agent.read_timeout = 4 
agent.keep_alive = true 
agent.redirect_ok = true 
LOGIN_URL = "https://www.google.com/accounts/Login?hl=en" 

login_page = agent.get(LOGIN_URL) 
login_form = login_page.forms.first 
login_form.Email = EMAIL 
login_form.Passwd = PASSWD 
login_response_page = agent.submit(login_form) 

redirect = login_response_page.meta[0].uri.to_s 

puts redirect.split('&')[0..-2].join('&') + "&continue=https://www.google.com/" 
followed_page = agent.get(redirect.split('&')[0..-2].join('&') + "&continue=https://www.google.com/adplanner") 
pp followed_page 

这对我来说工作得很好。我已经用meta标签(已经转义)替换了继续参数。