2014-12-27 68 views
0

我尝试从卷曲网站中取消某个日期。这里是我的代码:卷曲废料:错误集曲奇饼干

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois'); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent()); 
$result7 = htmlspecialchars_decode(curl_exec ($ch)); 
curl_close($ch); 

$html7 = new simple_html_dom(); 
$html7->load($result7); 

但我有以下警告错误:

Warning: file_get_contents(<!DOCTYPE html> <html xmlns:mml=" http://www.w3.org/1998/Math/MathML&quot ; lang="en" > <head> <script type="text/javascript"> var JiffyParams = { jsStart: (new Date()).getTime()}; </script> <meta name="robots" content="noarchive,noindex,nofollow,NOODP" /> <meta name="MSSmartTagsPreventParsing" content="true"/> <title>JSTOR: An Error Occurred Setting Your User Cookie</title> <meta charset="UTF-8"/> <link rel="shortcut icon" href="/templates/jsp/favicon.ico" type="image/vnd.microsoft.icon" /> <link rel="stylesheet" type="text/css" media="screen" href="/jawrcss/N815843185/bundles/jstor.css" /> <link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=Roboto:400,5 in C:\wamp\www\scrap_cairn\simple_html_dom.php on line 76

我不明白什么是我的错,我与卷曲初学者...也许我有从Jstor设置一些cookies,但我不知道该怎么做。感谢您的帮助。

编辑:

我只是说这和错误更改:

$ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois'); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
    curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent()); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt'); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); 
    $result7 = htmlspecialchars_decode(curl_exec ($ch)); 
    curl_close($ch); 

错误:

警告:!的file_get_contents(< DOCTYPE HTML > < - [如果IE 8 ] > < html class = " no-js lt-ie9 " lang = " en " > < [ENDIF] - > <! - [如果GT IE 8] > <! - > < HTML类= "没有-JS " LANG = "烯" > <! - < [ENDIF] - - > <头> <脚本类型= "文本/ JavaScript的" >(window.NREUM ||(NREUM = {}))loader_config = {Xpid中:" VwACUF9VGwsGXVRbAwA = "}; window.NREUM ||(NREUM = {} ),函数r(n){if(!e [n]){var o = e [n] = {exports:{}}; t [n] [0] .call(o.exports,function(e){var o = t [n] [1] [e]; return r(o?o:e)},o,o.exports )} return e [n] .exports} if(" function " == typeof __nr_require)return __nr_require; for(var o = 0; o < n.length; o ++)r(n [o]); return r}( {函数(t,e){函数n(t){函数e(e,n,a){t& t(e,n,a),a ||(a = {}); for (var c = s(e),f = c.length,u = i(a,o,r),d = 0; f > d; d ++)c [d] .apply(u,n); return u }函数a(t,e){f [t] = s(t).concat(e)}函数s(t){return f [t] || []}函数c(){return n(e) } var f = {};返回{on:a,emit:e,create:c,listeners:s,_events:在C:\ wamp \ www \ scrap_cairn \ simple_html_dom.php上线76

我添加一段代码from simple_html_dom about the line 76:

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 
{ 
    // We DO force the tags to be terminated. 
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 
    $contents = file_get_contents($url, $use_include_path, $context, $offset); 
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout. 
    //$contents = retrieve_url_contents($url); 
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) 
    { 
     return false; 
    } 
    // The second parameter can force the selectors to all be lowercase. 
    $dom->load($contents, $lowercase, $stripRN); 
    return $dom; 
} 

回答

0

确定file_get_html()是做这件事的好方法吗?这个函数调用file_get_contents(),它打开一个URI,并传递一个字符串(包含你的HTML数据)。

我认为从PHP str_get_html()简单的HTML DOM将是好方法。

+0

中添加来自simple_hteml_dom的代码段谢谢,它的工作原理! ;) – AlphaNico 2014-12-27 23:00:45

0

饼干是浏览器的东西。

curl是一个系统的东西(bash或linux或其他)。

php包装卷曲(有时实际上编译库内)。这或多或少是一个系统调用(没有浏览器参与)

因此,你需要用卷曲设置cookies:

http://curl.haxx.se/docs/http-cookies.html

但你是正确的 -

+0

谢谢,我如何从Jstor获取曲奇以设置Curl?我可以使用CURLOPT_COOKIEJAR和CURLOPT_COOKIEFILE之后吗? – AlphaNico 2014-12-27 20:43:01

+0

为什么你需要这个:“CURLOPT_FOLLOWLOCATION”?也许是饼干的事情 - 更多要遵循。你为我工作的代码 - 很好。但是,我没有使用new_simple_html_dom()。我设置user_agent – terary 2014-12-27 20:44:48

+0

我更新了我的问题。文章:我感谢这是最初的问题,但是当我删除跟踪位置时,它不会改变任何东西。 – AlphaNico 2014-12-27 20:50:27