我已检索此网页的内容http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369并将其保存到$webpage。DOM中的空属性返回意外的回退值

请注意：

在这个网页中，有许多的<meta>标签。其中一个元标签是罪魁祸首，并造成一些问题。这个元标签是<meta property="og:description" content="" />。请注意，content的值是一个空字符串。

我在看网页的内容如下：

<?php 

$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369'; 

$webpage = file_get_contents($url); 

$og_entry_title = ""; 
$og_entry_content = ""; 

$doc = new DOMDocument; 
$doc->loadHTML($webpage); 

$meta_tags = $doc->getElementsByTagName('meta'); 

foreach ($meta_tags as $meta_tag) { 

    if ($meta_tag->getAttribute('property') == 'og:title') { 
     $og_entry_title = $meta_tag->getAttribute('content'); 
    } 

    if ($meta_tag->getAttribute('property') == 'og:description') { 
     $og_entry_content = $meta_tag->getAttribute('content'); 
    } 

} 

// print the results 
echo 
'$og_entry_title: ' . $og_entry_title 
.PHP_EOL. 
'$og_entry_content: ' . $og_entry_content;

当我完成，我有$og_entry_title和$og_entry_content以下值：

$og_entry_title: TOP STORIES | DW.COM 
$og_entry_content: News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment.

请注意以下几点在结果中：

$og_entry_title是正确的，并包含页面标题，所以这里没问题

$og_entry_content给出了一个不同于我期望的值。我期望一个空字符串被保存在$og_entry_content;然而字符串“关于国际和欧洲主要话题的新闻和分析关于政治，商业，科学，文化，全球化和环境的时事和背景信息。”已保存。该字符串看起来是一个后备值（或默认值），只要元标签包含空字符串就会返回该值。

经过进一步调查后，结果发现go:description正在从http://www.dw.com网页获取其元标记值。这似乎是因为我的网页包含一个空字符串，返回的值是从网站的根页面检索。

我对$og_entry_content以下问题：

如何确保空字符串（不是后退值）保存到$og_entry_content？
为什么从根页面返回的这个回退值无论如何都被返回？

谢谢。

来源

2016-06-14 Greeso

我不能重现这一点。对我来说，在脚本'var_dump（$ og_entry_content）;'结果在'字符串（0）“'' –

末尾没有试过替代'get_meta_tags'，看着这个结尾，这应该是一个空字符串 – Ghost

@RodrigoDuterte - 'get_meta_tags'会导致同样的问题。 – Greeso

回答

您的网址中有需要被URL encoded特殊字符。

说明

首先，假设...

$og_entry_title是正确的，并包含页面标题，所以这里

...没有问题是错误的。

这个称号：

<meta property="og:title" content="تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006" />

是不一样的，因为这标题：

<meta property="og:title" content="TOP STORIES | DW.COM" />

其次，大部分现代浏览器有足够的真棒做对飞URL编码，仍然显示特殊字符在地址栏中。

您可以从网络服务器see the response headers了解更多信息。

<?php 
$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369'; 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, "$url"); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_VERBOSE, 1); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
$response = curl_exec($ch); 

// Then, after your curl_exec call: 
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE); 
echo ' 
header 
------ 
'.substr($response, 0, $header_size);

结果表明，它不能识别的URL和页面之间的关联：

header 
------ 
HTTP/1.1 301 Moved Permanently 
Server: Apache-Coyote/1.1 
Location:/
Content-Length: 0 
Accept-Ranges: bytes 
X-Varnish: 99639238 
Date: Thu, 16 Jun 2016 15:42:51 GMT 
Connection: keep-alive

HTTP Response Code 301是（永久）重定向到另一页的通知。 Location: /表示你应该去的主页。当他们不知道该怎么处理你时，只是发送一个人到主页，这是一种常见的草率做法。

默认情况下，Curl不会遵循重定向，这是我们如何检查301响应标头的方法。但file_get_contents将遵循重定向，这就是为什么你获得的内容比预期的要多。（有可能的例外：有一个bug report其中一些通知，它并不总是遵循重定向。）

注意，首页确实在其og:descriptioncontent：

<?php 
echo file_get_contents('http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369');

结果在这输出：

...

<meta property="og:description" content="News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment. " />

...

<meta property="og:title" content="TOP STORIES | DW.COM" />

...

解决方案

你需要做的第一件事是rawurlencode网址：

$url = rawurlencode($url);

进而实现那rawurlencode名称很差，因为valid URL将包含HTML协议http://或https://，并且还可能包含斜线来分隔部分。这是有问题的，因为rawurlencode会将冒号:转换为%3A并将/改为%2F，这会导致无效的URL，如http%3A%2F%2Fwww.dw.com%2Far%2F...。它应该被命名为rawurlencode_parts_of_URL，但他们没有问我:)而在他们的防守引用菲尔Karlton：

只有两种计算机科学坚硬的东西：缓存失效和命名的东西 。

所以转换斜线和冒号回到他们原来的形式：

$url = str_replace('%3A',':',str_replace('%2F','/',$url));

最后，你需要做的最后一件事是send a header to your clients to let them know what kind of font encoding to expect。

header("content-type: text/html; charset=utf-8");

否则，你的客户可能会阅读一些gobbledygook可能是这个样子：

ØªÙ,Ø±USO±Ø§Ø³ØªØ®Ø¨Ø§Ø±我们Ø§Ù... USO±ÙƒÙŠ：Ø§Ù“Ù,Ø§Ø¹Ø¯Ø©ØªØ³ÙŠØ·Ø±Ø¹Ù“U‰ØºØ±ØØ§Ù“Ø¹Ø±Ø§Ù

最终产品

个

<?php 

// let's see error output on screen while in development 
// remove these lines for production, and use log files only 
error_reporting(-1); 
ini_set('display_errors', 'On'); 

$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369'; 

// URL encode special chars 
$url = rawurlencode($url); 

// fix colons and slashses for valid URL 
$url = str_replace('%3A',':',str_replace('%2F','/',$url)); 

// make request 
$webpage = file_get_contents($url); 

$og_entry_title = ""; 
$og_entry_content = ""; 

$doc = new DOMDocument; 
$doc->loadHTML($webpage); 

$meta_tags = $doc->getElementsByTagName('meta'); 

foreach ($meta_tags as $meta_tag) { 

    if ($meta_tag->getAttribute('property') == 'og:title') { 
     $og_entry_title = $meta_tag->getAttribute('content'); 
    } 

    if ($meta_tag->getAttribute('property') == 'og:description') { 
     $og_entry_content = $meta_tag->getAttribute('content'); 
    } 

} 

// set the character set for the client 
header("content-type: text/html; charset=utf-8"); 

// print the results 
echo 
'$og_entry_title: ' . $og_entry_title 
.PHP_EOL. 
'$og_entry_content: ' . $og_entry_content;

结果输出：

$og_entry_title: تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006 
$og_entry_content:

附录

如果你正在寻找你的error logs，你真的应该总是是看着你的错误日志开发时，那么你会发现警告一连串的：

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 4 in ... 

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 5 in ... 

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 6 in ... 

Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 7 in ... 

Warning: DOMDocument::loadHTML(): ID topMetaInner already defined in Entity, line: 300 in ... 

Warning: DOMDocument::loadHTML(): ID langSelectTrigger already defined in Entity, line: 315 in ... 

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ... 

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ... 

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ... 

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...

这是因为你试图使用DOMDocument类与in-valid HTML and not well-formed XML documents。但这是另一个问题的主题。

来源

2016-06-16 19:14:29

感谢您的惊人的详细答案。我已经完成了你所提到的一切，但我仍然遇到了这个问题。我认为这是一个服务器没有给我发回适当的页面开始的问题。我将进一步调查。 – Greeso

真的吗？你*不会得到与我在运行“最终产品”脚本时显示的结果相同的输出结果？我已更新答案以在屏幕上显示错误。你的输出是什么？ –

DOM中的空属性返回意外的回退值

回答

回答

说明

解决方案

最终产品

附录

相关问题