2012-11-02 102 views
0

我被要求从页面抓取某一行,但看起来该网站阻止了CURL请求?从阻止CURL的页面抓取HTML

有问题的网站是http://www.habbo.com/home/Intricat

我试图改变用户代理,看看他们是否被阻断,但它似乎没有这样的伎俩。

我使用的代码如下:

<?php 

$curl_handle=curl_init(); 
//This is the URL you would like the content grabbed from 
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0"); 
curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat'); 
//This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page" 
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2); 

curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1); 
$buffer = curl_exec($curl_handle); 
//This Keeps everything running smoothly 
curl_close($curl_handle); 

// Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes. 
if (empty($buffer)) 
{ 
    print "Sorry, It seems our weather resources are currently unavailable, please check back later."; 
} 
else 
{ 
    print $buffer; 
} 
?> 

的另一种方式我可以抓住的代码,如果他们已经封锁卷曲请求该页面线任何想法?

编辑:在运行curl -i通过我的服务器,它显示该网站首先设置cookie?

+0

尝试使用代理并设置推荐链接 – Waygood

+0

*“我们的天气资源”*? - 我敢肯定你的意思是habbo.com的天气资源,对吧? – hakre

+0

这只是一个随机站点的代码。忽略该部分:P – Tenatious

回答

1

你对于你正在谈论的区块类型并不是非常具体。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> 
<html> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
    <meta http-equiv="Content-Script-Type" content="text/javascript"> 
    <script type="text/javascript">function setCookie(c_name, value, expiredays) { 
     var exdate = new Date(); 
     exdate.setDate(exdate.getDate() + expiredays); 
     document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/"; 
    } 
    function getHostUri() { 
     var loc = document.location; 
     return loc.toString(); 
    } 
    setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10); 
    setCookie('DOAReferrer', document.referrer, 10); 
    location.href = getHostUri();</script> 
</head> 
<body> 
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your 
    browser. 
</noscript> 
</body> 
</html> 

由于卷曲没有JavaScript的支持,您可能需要使用一个HTTP客户端时,你需要模仿脚本 - 或 - 和:如果浏览器已启用JavaScript,问题http://www.habbo.com/home/Intricat网站做了所有检查的第一创建您自己的cookie和新的请求URI。

+0

我会如何去模仿这个? – Tenatious

+1

您可以通过阅读javascript代码并理解它的功能来模仿它。然后,您将该知识转换为PHP代码并转换为curl请求配置。可以这么说,你只需在浏览器中完成javascript的工作即可。只需在PHP中而不是JavaScript并兼容卷曲。您可能需要解析HTML和JavaScript。对于HTML解析我强烈建议PHP的'DOMDocument'。第一课是在这里提取'

1

请使用浏览器并复制正在发送的确切标头, 由于请求看起来完全一样,网站将无法辨别您正在尝试卷曲。 如果使用cookie - 将它们作为标题附加。

+0

您能否详细介绍一下我的这个? – Tenatious

1

这是从我的卷发课上剪下来的贴子,我做了好几年,希望你能为自己挑选一些宝石。

function get_url($url) 
{ 
    curl_setopt ($this->ch, CURLOPT_URL, $url); 
    curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent); 
    curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name); 
    curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name); 
    if(!is_null($this->referer)) 
    { 
     curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer); 
    } 
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2); 
    curl_setopt ($this->ch, CURLOPT_HEADER, 0); 
    if($this->follow) 
    { 
     curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1); 
    } 
    else 
    { 
     curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0); 
    } 
    curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*")); 
    curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE); // this line makes it work under https 

    $try=0; 
    $result=""; 
    while(($try<=$this->retry_attempts) && (empty($result))) // force a retry upto 5 times 
    { 
     $try++; 
     $result = curl_exec($this->ch); 
     $this->response=curl_getinfo($this->ch); 
     // $response['http_code'] 4xx is an error 
    } 
    // set refering URL to current url for next page. 
    if($this->referer_to_last) $this->set_referer($url); 

    return $result; 
} 
+0

$ cookie_name =“./ cookie”;确保您的脚本具有对您选择的目录的写入权限 – Waygood

+0

致命错误:在不在对象上下文中时使用$ this – Tenatious

+1

__cut并从我的Curl类中粘贴_ – Waygood

0

我知道这是一个很老的帖子,但是因为我今天不得不回答自己同一个问题,所以我在这里分享给大家,它可能对他们有用。我也完全知道OP特别要求curl,但和我一样 - 可能有人对解决方案感兴趣,无论是否curl

我想用curl获取的页面将其屏蔽。如果块因为javascript,但因为代理(这是我的情况,并在curl设置代理没有帮助),那么wget可能是一个解决办法:

wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"