2016-11-21 49 views
0

我试图删除这个问题 - 但第二个想法我会保留它 - 这是一个现场演示,作为开发人员,我应该更加注意细节curl和python请求库的奇怪行为

我想从网站获取一些数据。请求的URL将查看请求的内容类型,然后相应地作出响应。

所以curl命令我想:

curl --header "Accept: application/json, text/javascript, */*; q=0.01\r\nX-Requested-With: XMLHttpRequest\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36\r\n" http://www.tpex.org.tw/web/stock/margin_trading/margin_balance/margin_bal_result.php\?l\=en-us\&d\=2016/11/15\&_\=1479700586981 -v 
* About to connect() to www.tpex.org.tw port 80 (#0) 
* Trying 210.63.162.130... connected 
> GET /web/stock/margin_trading/margin_balance/margin_bal_result.php?l=en-us&d=2016/11/15&_=1479700586981 HTTP/1.1 
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3 
> Host: www.tpex.org.tw 
> Accept: application/json, text/javascript, */*; q=0.01\r\nX-Requested-With: XMLHttpRequest\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36\r\nAccept-Encoding: gzip,deflate,sdch\r\n 
> 
* HTTP 1.0, assume close after body 
< HTTP/1.0 200 OK 
< Date: Mon, 21 Nov 2016 07:35:56 GMT 
< Server: Apache 
< Content-Type: text/html; charset=utf-8 
< X-Cache: MISS from localhost 
< X-Cache-Lookup: MISS from localhost:3128 
< Via: 1.0 localhost (squid/3.1.19) 
< Connection: close 
< 
{"reportDate":"2016\/11\/15","iTotalRecords":610,"aaData":[["006201","YA HORNG ELECTRONIC CO.","6","0","0","0","6","0","0.09","6,361","0","0","0","0","0","0","0.0","6,361","0",""],...} 

响应被截断,但基本上它是JSON。

但是,有我的Python代码,我不认为有太大的区别。但响应的HTML ...

g_tpex_headers = { 
    'Accept-Encoding': 'gzip,deflate,sdch', 
    'Accept': 'application/json, text/javascript, */*; q=0.01', 
    'User-Agent': (
     'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36' 
     ' (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120' 
     ' Chrome/37.0.2062.120 Safari/537.36' 
    ), 
    'X-Requested-With': 'XMLHttpRequest', 
} 
data_link = (
    'http://www.tpex.org.tw/web/stock/margin_trading/margin_balance/' 
    'margin_bal.php?l=en-us&d={}&_=1479700586981' 
) 
data = [] 
with requests.Session() as session: 
    session.headers = g_tpex_headers 
    res = session.get(
     actual_data_link.format(target_dt.strftime('%Y/%m/%d')) 
    ) 
    print(res.content[:400]) 

日志:

send: 'GET /web/stock/margin_trading/margin_balance/margin_bal.php?l=en-us&d=2016/11/18&_=1479700586981 HTTP/1.1\r\nHost: www.tpex.org.tw\r\nX-Requested-With: XMLHttpRequest\r\nAccept-Encoding: gzip,deflate,sdch\r\nAccept: application/json, text/javascript, */*; q=0.01\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36\r\n\r\n' 

和响应

<!DOCTYPE HTML> 
<html> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
<meta http-equiv="X-UA-Compatible" content="IE=Edge"> 
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> 
<title> HOME&nbsp;&gt;&nbsp;Mainboard&nbsp;&gt;&nbsp;Margin Trading&nbsp;&gt;&nbsp;Margin Balance</title> 
<link rel="icon" type="image/ico" href="/web/images/favicon.ic 

我看不出太大的区别。那么为什么python请求没有得到JSON响应。

回答

2

您提出请求的路径是不同的。在cURL命令中,最终路径组件是margin_bal_result.php,在Python脚本中它是margin_bal.php。一旦在Python脚本中更改路径以匹配cURL命令中的路径,您将获得JSON响应。

更新:使用cURL,您需要单独指定标题,而不是将它们添加到一起。所以,在你的榜样,你应该使用下面的命令:

curl --header "Accept: application/json, text/javascript, */*; q=0.01" --header "X-Requested-With: XMLHttpRequest" --header "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36" http://www.tpex.org.tw/web/stock/margin_trading/margin_balance/margin_bal_result.php\?l\=en-us\&d\=2016/11/15\&_\=1479700586981 -v > httpres.txt 

这将导致到下面的请求被发送:

* Hostname was NOT found in DNS cache 
    % Total % Received % Xferd Average Speed Time Time  Time Current 
           Dload Upload Total Spent Left Speed 
    0  0 0  0 0  0  0  0 --:--:-- --:--:-- --:--:--  0* Trying 210.63.162.130... 
* Connected to www.tpex.org.tw (210.63.162.130) port 80 (#0) 
> GET /web/stock/margin_trading/margin_balance/margin_bal_result.php?l=en-us&d=2016/11/15&_=1479700586981 HTTP/1.1 
> Host: www.tpex.org.tw 
> Accept: application/json, text/javascript, */*; q=0.01 
> X-Requested-With: XMLHttpRequest 
> User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36 
+0

感谢您的帮助,我正在如此粗心大意,以至于我无法回避 –

+0

我有一个疑问,对于卷曲,请注意有一行表示用户代理卷曲 - 该行不会被发送到服务器上吗? –

+0

@JunchaoGu关于cURL使用的更新结果 – niemmi

1

尝试使python中的请求与您的卷曲完全相同。 代码:

data_link = (
    'http://www.tpex.org.tw/web/stock/margin_trading/margin_balance/' 
    'margin_bal.php?l=en-us&d={}&_=1479700586981' 
) 

改变:我纠正DATA_LINK

data_link = (
    'http://www.tpex.org.tw/web/stock/margin_trading/margin_balance/' 
    'margin_bal_result.php?l=en-us&d={}&_=1479700586981' 
) 

之后,我发现它的实际工作。

+0

......不要以为这是关键 –

+0

@JunchaoGu这是关键如我所承诺的。您在curl中使用__margin_bal_result.php__,但在Python代码中使用__margin_bal.php__。 – Jing

+0

@JunchaoGu如果HTTP请求是相同的,那么结果是一样的,使用哪个库进行请求并不重要,因为远程网站无论如何都不知道。所以诀窍总是让请求相同。有像Wireshark这样的工具,它允许在发送时以原始格式检查传出的HTTP请求。 –