Final Update It appears that the targeted website blocked DO IPs and are giving the problems which I've been resolving for days. I spinned a EC2 instance and manage to work the code working, together with caching etc so as to reduce the hit on the website and allow my user to share the website.PHP卷曲405不允许
-
UPDATE: I manage to get the Html by setting curl error to off, however the website other than returning 405 error is also not setting some cookies which are required for the website content to be loaded.
curl_setopt($ CH,CURLOPT_FAILONERROR,FALSE);
我使用下面的代码为ajax-> PHP来检索og:元网站。但是,有1或2个特定网站返回错误,并不会检索信息。有以下错误。该代码可以为大多数网站无缝工作。
Warning: DOMDocument::loadHTML(): Empty string supplied as input in /my/home/path/getUrlMeta.php on line 58
从curl_error在我的error_log
The requested URL returned error: 405 Not Allowed
而且
Failed to connect to www.something.com port 443: Connection refused
我没有问题得到当我用卷曲我的服务器控制台上的网站的HTML和没有问题的检索大部分使用以下代码的网站所需信息
function file_get_contents_curl($url)
{
$ch = curl_init();
$header[0] = "Accept: text/html, text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: no-cache";
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
//curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 ");
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
//The following 2 set up lines work with sites like www.nytimes.com
//Update: Added option for cookie jar since some websites recommended it. cookies.txt is set to permission 777. Still doesn't work.
$cookiefile = '/home/my/folder/cookies.txt';
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);
$data = curl_exec($ch);
if(curl_error($ch))
{
error_log(curl_error($ch));
}
curl_close($ch);
return $data;
}
$html = file_get_contents_curl($url);
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = substr($meta->getAttribute('property'),3);
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
/*below code retrieves the next bigger than 600px image should og:image be empty.*/
if (empty($rmetas['image'])) {
//$src = $xpath->evaluate("string(//img/@src)");
//echo "src=" . $src . "\n";
$query = '//*/img';
$srcs = $xpath->query($query);
foreach ($srcs as $src) {
$property = $src->getAttribute('src');
if (substr($property,0,4) == 'http' && in_array(substr($property,-3), array('jpg','png','peg'), true)) {
if (list($width, $height) = getimagesize($property)) {
do if ($width > 600) {
$rmetas['image'] = $property;
break;
} while (0);
}
}
}
}
echo json_encode($rmetas);
die();
UPDATE: Error on my part that website is not https enabled so I still have the 405 not allowed error.
卷曲信息
{
"url": "http://www.example.com/",
"content_type": null,
"http_code": 405,
"header_size": 0,
"request_size": 458,
"filetime": -1,
"ssl_verify_result": 0,
"redirect_count": 0,
"total_time": 0.326782,
"namelookup_time": 0.004364,
"connect_time": 0.007725,
"pretransfer_time": 0.007867,
"size_upload": 0,
"size_download": 0,
"speed_download": 0,
"speed_upload": 0,
"download_content_length": -1,
"upload_content_length": -1,
"starttransfer_time": 0.326634,
"redirect_time": 0,
"redirect_url": "",
"primary_ip": "SOME IP",
"certinfo": [],
"primary_port": 80,
"local_ip": "SOME IP",
"local_port": 52966
}
Update: If I do a curl -i from console I get the following response. A error 405 but it follows by all the HTML that I need.
Home> curl -i http://www.domain.com
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Wed, 22 Feb 2017 17:57:03 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Vary: Accept-Encoding
Vary: Accept-Encoding
Set-Cookie: PHPSESSID2=ko67tfga36gpvrkk0rtqga4g94; path=/; domain=.domain.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: __PAGE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
Set-Cookie: __PAGE_SITE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
X-Repository: legacy
X-App-Server: production-web23:8018
X-App-Server: distil2-kvm:80
如果它只在某些站点停止工作,这是服务器端问题。我们无能为力。 – miken32
@ miken32但可从网络浏览器访问URL。不卷曲模拟浏览器?这是一个公开访问的网站,不需要登录,不需要SSL等。 –
删除'CURLOPT_FAILONERROR',你将得到405的全部内容,就像你展示的命令行一样。 –