2015-06-19 64 views
0

我试图抓取此网页:SiriusXMU以获取“正在播放”的信息。下面是到目前为止,我已经得到了代码:为特定网页检索定义特定的PHP卷曲选项

$timeout = 60; 
    $url = 'http://www.siriusxm.com/siriusxmu'; 
    $agent= 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0'; 
    $referer = 'http://www.siriusxm.com/channellineup/'; 

    $header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; 
    $header[] = "Cache-Control: max-age=0"; 
    $header[] = "Connection: keep-alive"; 
    //$header[] = "Keep-Alive: 300"; 
    //$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
    $header[] = "Accept-Language: en-US,en;q=0.5"; 

    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, $url);//The URL to fetch. This can also be set when initializing a session with curl_init(). 
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);//The contents of the "User-Agent: " header to be used in a HTTP request. 
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);//An array of HTTP header fields to set, in the format array('Content-type: text/plain', 'Content-length: 100') 
    curl_setopt($ch, CURLOPT_HEADER, true);//TRUE to include the header in the output. 
    curl_setopt($ch, CURLOPT_REFERER, $referer);//The contents of the "Referer: " header to be used in a HTTP request. 
    curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');//The contents of the "Accept-Encoding: " header. This enables decoding of the response. Supported encodings are "identity", "deflate", and "gzip". If an empty string, "", is set, a header containing all supported encoding types is sent. 
    //curl_setopt($ch, CURLOPT_AUTOREFERER, true);//TRUE to automatically set the Referer: field in requests where it follows a Location: redirect. 
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);//TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set). 
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);//The maximum number of seconds to allow cURL functions to execute. 
    //curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);//FALSE to stop cURL from verifying the peer's certificate. Alternate certificates to verify against can be specified with the CURLOPT_CAINFO option or a certificate directory can be specified with the CURLOPT_CAPATH option. 
    //curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);1 to check the existence of a common name in the SSL peer certificate. 2 to check the existence of a common name and also verify that it matches the hostname provided. In production environments the value of this option should be kept at 2 (default value). 
    //curl_setopt($ch, CURLOPT_VERBOSE, true);//TRUE to output verbose information. Writes output to STDERR, or the file specified using CURLOPT_STDERR. 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);//if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure. 
    //  
    $result = curl_exec($ch);//Returns TRUE on success or FALSE on failure. However, if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure. 
    curl_close($ch); 

我一直在学习,我的浏览器将其成功地使网页的“上航”节这说明什么现在打的HTTP标头。但是,当我使用curl模拟这些标题时,网页的“One the Air”部分将返回“对不起,程序信息不适用于所选平台”。

的Firefox附加元件的HttpFox显示主页如下:

00:00:03.904 0.163 1524 209 GET 200 text/html http://www.siriusxm.com/siriusxmu 

(Request-Line) GET /siriusxmu HTTP/1.1 
Host www.siriusxm.com 
User-Agent Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
Accept-Language en-US,en;q=0.5 
Accept-Encoding gzip, deflate 
Referer http://www.siriusxm.com/channellineup/ 
Cookie mmcore.tst=0.557; mmid=-318486443%7CBQAAAAo2JYEzEgwAAA%3D%3D; mmcore.pd=111492824%7CBQAAAAoBQjYlgTMSDPt9EvUCAJ3zFneyeNJIDwAAAIQ4RsgceNJIAAAAAP//////////ABB3d3cuc2lyaXVzeG0uY29tAhIMAgAAAAAAAAAAAAD///////////////8AAAAAAAFF; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-Repeat; s_vnum=1435723200051%26vn%3D2; s_lastvisit=1434723660883; s_vi=[CS]v1|2AC1956485078C76-6000010E20030C67[CE]; mm_pc=%7B%22vehiclenewness%22%3A%22new%22%2C%22PC2%22%3A%22%22%7D; sxm_platform=xm; __utmv=1.|5=serviceType=xm=1; _hjUserId=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=aHR0cDovL3d3dy5zaXJpdXN4bS5jb20vc3RyZWFtaW5n; __insp_norec_sess=true; _hjIncludedInSample=1; __utmc=1; s_cc=true; SC_LINKS=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; s_sv_sid=797366592635; QSI_HistorySession=http%3A%2F%2Fwww.siriusxm.com%2Fstreaming~1434659533837%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fchannellineup%2F%23~1434659556190%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 
Connection keep-alive 

,并请求对“一航”部分的JavaScript时,以下几点:

00:00:05.293 1.186 1609 (137) GET 304 text/javascript http://www.siriusxm.com/static/app/js/sxm-channel-ontheair.js 

(Request-Line) GET /static/app/js/sxm-channel-ontheair.js HTTP/1.1 
Host www.siriusxm.com 
User-Agent Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 
Accept */* 
Accept-Language en-US,en;q=0.5 
Accept-Encoding gzip, deflate 
Referer http://www.siriusxm.com/siriusxmu 
Cookie mmcore.tst=0.557; mmid=-318486443%7CBQAAAAo2JYEzEgwAAA%3D%3D; mmcore.pd=111492824%7CBQAAAAoBQjYlgTMSDPt9EvUCAJ3zFneyeNJIDwAAAIQ4RsgceNJIAAAAAP//////////ABB3d3cuc2lyaXVzeG0uY29tAhIMAgAAAAAAAAAAAAD///////////////8AAAAAAAFF; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-Repeat; s_vnum=1435723200051%26vn%3D2; s_lastvisit=1434723660883; s_vi=[CS]v1|2AC1956485078C76-6000010E20030C67[CE]; mm_pc=%7B%22vehiclenewness%22%3A%22new%22%2C%22PC2%22%3A%22%22%7D; sxm_platform=xm; __utmv=1.|5=serviceType=xm=1; _hjUserId=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=aHR0cDovL3d3dy5zaXJpdXN4bS5jb20vc3RyZWFtaW5n; __insp_norec_sess=true; _hjIncludedInSample=1; __utmc=1; s_cc=true; SC_LINKS=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; s_sv_sid=797366592635; QSI_HistorySession=http%3A%2F%2Fwww.siriusxm.com%2Fstreaming~1434659533837%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fchannellineup%2F%23~1434659556190%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 
Connection keep-alive 
If-Modified-Since Fri, 22 May 2015 02:06:57 GMT 
If-None-Match "ab841364-8501-516a21d70499b" 
Cache-Control max-age=0 

Web服务器是确定对我的curl请求无效,并且未启用“On the Air”javascript内容,只是说“对不起,程序信息不适用于所选平台”。

如何让curl正常工作并模拟我的浏览器,从而从此Web服务器返回有效的网页结果?

回答

2

看来您需要运行一个具有JavaScript解释器的客户端。

的HTML包括以下内容:

<div id="on-the-air-unavailable"><p>Sorry, program information is not available for the selected platform.</p></div> 

的JS包括以下(不在一起):

$("#on-the-air-unavailable").hide(); 
$("#on-the-air-unavailable").show(); 

要使JavaScript和你将需要运行在一起的HTML互动。

有一些无头HTTP客户端可以使用JS解释器或像Selenium这样的浏览器自动化工具。

+0

请建议一些有JS解释器的无头HTTP客户端:看起来SimpleTest的PHP脚本化Web浏览器(http://www.simpletest.org/en/browser_documentation.html)不包含JS解释器。编辑:我发现一些在下面的答案:http://stackoverflow.com/a/814929/5006730 – BartmanEH