如何以数组形式或xml格式获取html数据？

我想以数组形式或xml格式获取我的html数据，以便将其轻松保存到数据库中。这里是我的工作到目前为止：如何以数组形式或xml格式获取html数据？

$url = "http://www.example.com/"; 

    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); 
    if($html = curl_exec($ch)){ 

     // parse the html into a DOMDocument 
     $dom = new DOMDocument(); 

     $dom->recover = true; 
     $dom->strictErrorChecking = false; 

     @$dom->loadHTML($html); 

     $hrefs = $dom->getElementsByTagName('div'); 


     curl_close($ch); 


    }else{ 
     echo "The website could not be reached."; 
    }

我应该怎么做，以获得在数组形式或XML格式的HTML。未来的HTML是这样的：

<div> 
<ul> 
    <li>Product Name</li> 
    <li>Category</li> 
    <li>Subcategory</li> 
    <li>Product Price</li> 
    <li>Product Company</li> 
</ul> 
</div>

来源

2013-09-26 Aashi

你前面的问题的可能重复：如何在数据库中添加刮掉网站数据？（HTTP：/ /stackoverflow.com/questions/18997932/how-to-add-scraped-website-data-in-database）。请注意不要多次重复提问相同的问题。 – halfer

对于XML输出只是做象下面这样：

function download_page($path){ 
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,$path); 
curl_setopt($ch, CURLOPT_FAILONERROR,1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); 
curl_setopt($ch, CURLOPT_TIMEOUT, 15); 
$retValue = curl_exec($ch);   
curl_close($ch); 
return $retValue; 
} 

$sXML = download_page('http://example.com'); 
$oXML = new SimpleXMLElement($sXML); 

foreach($oXML->entry as $oEntry){ 
    header('Content-type: application/xml') 
    echo $oEntry->title . "\n"; 
}

来源

2013-09-26 06:15:36

但是，它给了我这个错误：致命错误：在D：\ wampserver \ www \ test1.php：17堆栈跟踪：＃0 D：\ wampserver中带有消息'String can not be parsed as XML'的未捕获异常'Exception' \ www \ test1.php（17）：SimpleXMLElement - > __ construct（'<！DOCTYPE html ...'）＃1 {main}抛出第17行D：\ wampserver \ www \ test1.php – Aashi

如何以数组形式或xml格式获取html数据？

回答

相关问题