2013-09-26 33 views
1

我想以数组形式或xml格式获取我的html数据,以便将其轻松保存到数据库中。这里是我的工作到目前为止:如何以数组形式或xml格式获取html数据?

$url = "http://www.example.com/"; 

    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); 
    if($html = curl_exec($ch)){ 

     // parse the html into a DOMDocument 
     $dom = new DOMDocument(); 

     $dom->recover = true; 
     $dom->strictErrorChecking = false; 

     @$dom->loadHTML($html); 

     $hrefs = $dom->getElementsByTagName('div'); 


     curl_close($ch); 


    }else{ 
     echo "The website could not be reached."; 
    } 

我应该怎么做,以获得在数组形式或XML格式的HTML。未来的HTML是这样的:

<div> 
<ul> 
    <li>Product Name</li> 
    <li>Category</li> 
    <li>Subcategory</li> 
    <li>Product Price</li> 
    <li>Product Company</li> 
</ul> 
</div> 
+0

你前面的问题的可能重复:如何在数据库中添加刮掉网站数据?(HTTP:/ /stackoverflow.com/questions/18997932/how-to-add-scraped-website-data-in-database)。请注意不要多次重复提问相同的问题。 – halfer

回答

1

对于XML输出只是做象下面这样:

function download_page($path){ 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,$path); 
curl_setopt($ch, CURLOPT_FAILONERROR,1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); 
curl_setopt($ch, CURLOPT_TIMEOUT, 15); 
$retValue = curl_exec($ch);   
curl_close($ch); 
return $retValue; 
} 

$sXML = download_page('http://example.com'); 
$oXML = new SimpleXMLElement($sXML); 

foreach($oXML->entry as $oEntry){ 
    header('Content-type: application/xml') 
    echo $oEntry->title . "\n"; 
} 
+0

但是,它给了我这个错误: 致命错误:在D:\ wampserver \ www \ test1.php:17堆栈跟踪:#0 D:\ wampserver中带有消息'String can not be parsed as XML'的未捕获异常'Exception' \ www \ test1.php(17):SimpleXMLElement - > __ construct('<!DOCTYPE html ...')#1 {main}抛出第17行D:\ wampserver \ www \ test1.php – Aashi

相关问题