2017-09-14 69 views
0

我试图在无头环境下从http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html下载文件。我有一个帐户(他们是免费的),但该网站确实不容易,显然它使用了JavaScript形式/重定向链。在Firefox中,我可以使用元素检查器,在下载开始时将文件的url复制为cURL,并将其用于无头机器中以下载文件,但到目前为止,我只有在无头机器中获取文件的所有尝试都有失败。无头javascript下载与硒

我已经设法获得与登录:

#!/usr/bin/env python3 

username="<my username>" 
password="<my password>" 

import requests 
from selenium import webdriver 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
caps = DesiredCapabilities.PHANTOMJS 
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0" 
driver = webdriver.PhantomJS("/usr/local/bin/phantomjs") 
driver.set_window_size(1120, 550) 
driver.get("http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html") 
print("loaded") 
driver.find_element_by_name("agreement").click() 
print("clicked agreement") 
driver.find_element_by_partial_link_text("RPM installer").click() 
print("clicked link") 
driver.find_element_by_id("sso_username").send_keys(username) 
driver.find_element_by_id("ssopassword").send_keys(password) 
driver.find_element_by_xpath("//input[contains(@title,'Please click here to sign in')]").click() 
print("submitted") 

print(driver.get_cookies()) 

print(driver.current_url) 
print(driver.page_source) 
driver.quit() 

我怀疑的登录工作,因为在饼干我看到我的用户名相关的一些数据,但在Firefox 1.5.1下载表单结果在3-4次重定向之后开始,而在这里我什么也没得到,并且page_sourcecurrent_url仍然属于登录页面。

也许该网站正在积极阻止这种用途,或者我做错了什么。任何想法如何实际下载文件?

+0

看到这个问题。 https://bugs.chromium.org/p/chromium/issues/detail?id=696481。我认为该功能尚未在chromedriver中提供 –

+0

@TarunLalwani硒+幻像是否在引擎盖下使用铬? – Jellby

+0

不,但是phantomjs现在也没有被维护。所以使用它非常小心。如果它有效,那么它的好,如果没有的话,再想想其他的东西 –

回答

1

感谢TheChetan的评论我得到了它的工作。尽管我没有使用javascript-blob路由,但是Tarun Lalwani在https://stackoverflow.com/a/46027215中提出的requests方法。我花了一段时间才意识到我也必须修改请求中的用户代理。最后,这对我的作品:

#!/usr/bin/env python3 

from selenium import webdriver 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
from requests import Session 
from urllib.parse import urlparse 
from os.path import basename 
from hashlib import sha256 
import sys 

index_url = "http://www.oracle.com/technetwork/server-storage/developerstudio/downloads/index.html" 
link_text = "RPM installer" 
username="<my username>" 
password="<my password>" 
user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0" 

# set up browser 
caps = DesiredCapabilities.PHANTOMJS 
caps["phantomjs.page.settings.userAgent"] = user_agent 
driver = webdriver.PhantomJS("/usr/local/bin/phantomjs") 
driver.set_window_size(800,600) 

# load index page and click through 
driver.get(index_url) 
print("loaded") 
driver.find_element_by_name("agreement").click() 
print("clicked agreement") 
link = driver.find_element_by_partial_link_text(link_text) 
sha = driver.find_element_by_xpath("//*[contains(text(), '{0}')]/following::*[contains(text(), 'sum:')]/following-sibling::*".format(link_text)).text 
file_url = link.get_attribute("href") 
filename = basename(urlparse(file_url).path) 
print("filename: {0}".format(filename)) 
print("checksum: {0}".format(sha)) 
link.click() 
print("clicked link") 
driver.find_element_by_id("sso_username").send_keys(username) 
driver.find_element_by_id("ssopassword").send_keys(password) 
driver.find_element_by_xpath("//input[contains(@title,'Please click here to sign in')]").click() 
print("submitted") 

# we should be logged in now 

def progressBar(title, value, endvalue, bar_length=60): 
    percent = float(value)/endvalue 
    arrow = '-' * int(round(percent * bar_length)-1) + '>' 
    spaces = ' ' * (bar_length - len(arrow)) 
    sys.stdout.write("\r{0}: [{1}] {2}%".format(title, arrow + spaces, int(round(percent * 100)))) 
    sys.stdout.flush() 

# transfer the cookies to a new session and request the file 
session = Session() 
session.headers = {"user-agent": user_agent} 
for cookie in driver.get_cookies(): 
    session.cookies.set(cookie["name"], cookie["value"]) 
driver.quit() 
r = session.get(file_url, stream=True) 
# now we should have gotten the url with param 
new_url = r.url 
print("final url {0}".format(new_url)) 
r = session.get(new_url, stream=True) 
print("requested") 
length = int(r.headers['Content-Length']) 
title = "Downloading ({0})".format(length) 
sha_file = sha256() 
chunk_size = 2048 
done = 0 
with open(filename, "wb") as f: 
    for chunk in r.iter_content(chunk_size): 
    f.write(chunk) 
    sha_file.update(chunk) 
    done = done+len(chunk) 
    progressBar(title, done, length) 
print() 

# check integrity 
if (sha_file.hexdigest() == sha): 
    print("checksums match") 
    sys.exit(0) 
else: 
    print("checksums do NOT match!") 
    sys.exit(1) 

所以最终的想法是用硒+ phantomjs用于登录,然后使用cookie进行一个简单的请求。