解析给出URL的网页源代码

-1

如何解析某个网页的源代码，给定URL？我想从源代码中找到作者，标题和上次修改时间。解析给出URL的网页源代码

我的想法是用file_get_contents（）解析源代码。然后，对于作者，我会查看< meta name =“author”content =“[...]”>的源代码，然后提取内容。对于标题，我会寻找<标题> </title>并提取里面的内容。我不知道我会怎么做才能找到最后一次修改的时间。

这些方法可行吗？有更好的方法吗？

来源

2014-10-17 cycloidistic

。 PHP的[DOM]（http://php.net/manual/en/book.dom.php）为解析和操作HTML和XML提供了大量选项。您可以使用'file_get_contents'或curl来检索页面。 – 2014-10-17 10:03:42

我想找到标题，作者以及上次修改时间。 – cycloidistic 2014-10-17 10:07:08

网页之间的差异很大 - 您需要给出您尝试解析的网页样本。 – 2014-10-17 10:10:14

您可以使用file_get_contents。

例如：

$content = file_get_contents('http://www.external-site.com/page.php');

然后变量$内容将具有外部网站的内容。

来源

2014-10-17 10:03:46

您需要解析DOM

尝试使用解析器像这样的：http://simplehtmldom.sourceforge.net/

来源

2014-10-17 10:16:47 Nausik

使用卷曲，而不是（它仍然在“allow_url_fopen选项”指令是假的，它更灵活的工作）。

要解析网页源代码，请使用DOM库，但在加载HTML内容之前应该禁用libxml错误输出。

例如：你想怎么分析它取决于你想用它做什么

<?php 
$url = 'http://stackoverflow.com/'; 

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$content = curl_exec($ch); 
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); 
curl_close($ch); 
if($content === null || $httpCode >= 400) { 
    die(); 
} 

libxml_use_internal_errors(true); 
$dom = new DOMDocument(); 
$dom->loadHTML($content); 

$title = null; 
$titleNodes = $dom->getElementsByTagName('title'); 
if($titleNodes->length === 1) { 
    $title = $titleNodes->item(0)->textContent; 
}

来源

2014-10-17 10:28:07

解析给出URL的网页源代码

回答

相关问题