2013-02-17 120 views
2

我写每周一次将运行一个PHP老太婆作业脚本网络爬虫 - 2000多个网页中获取数据(TED网站为例)

这个脚本的主要目的是从所有的TED得到细节会谈上可用的TED 我们的网站(例如,为了使这个问题更容易理解)

该脚本将花费大约70分钟来运行,并越过2000网页

我的问题是:

1)是有没有更好/更快捷的方式使用该函数来获取网页中的每个时间,即时通讯:

file_get_contents_curl($网址)

2)它是一个很好的做法,以保持在所有会谈数组(可以变得相当大)

3)有没有更好的方法来获得例如网站上的所有特德演讲细节?在TED网站上“抓取”以获得所有会谈的最佳方式是什么?

**我已选中使用RSS源的选项,但缺少一些我需要的细节。

感谢

<?php 
define("START_ID", 1); 
define("STOP_TED_QUERY",20); 
define ("VALID_PAGE","TED | Talks"); 
/** 
* this script will run as a cron job and will go over all pages 
* on TED http://www.ted.com/talks/view/id/ 
* from id 1 till there are no more pages 
*/ 

/** 
* function get a file using curl (fast) 
* @param $url - url which we want to get its content 
* @return the data of the file 
* @author XXXXX 
*/ 
function file_get_contents_curl($url) 
{ 
    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_HEADER, 0); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 

    $data = curl_exec($ch); 
    curl_close($ch); 

    return $data; 
} 

//will hold all talks in array 
$tedTalks = array(); 

//id to start the query from 
$id=START_ID; 

//will indicate when needed to stop the query beacuse reached the end id's on TED website 
$endOFQuery=0; 

//get the time 
$time_start = microtime(true); 

//start the query on TED website 
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more 
while ($endOFQuery < STOP_TED_QUERY){ 

    //get the page of the talk 
    $html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id"); 

    //parsing begins here: 
    $doc = new DOMDocument(); 
    @$doc->loadHTML($html); 
    $nodes = $doc->getElementsByTagName('title'); 

    //get and display what you need: 
    $title = $nodes->item(0)->nodeValue; 


    //check if this a valid page 
    if (! strcmp ($title , VALID_PAGE)) 
     //this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop) 
     $endOFQuery++; 
    else { 
     //this is a valid TED talk get its details 

     //reset the flag for end of query 
     $endOFQuery = 0; 

     //get meta tags 
     $metas = $doc->getElementsByTagName('meta'); 

     //get the tag we need (keywords) 
     for ($i = 0; $i < $metas->length; $i++) 
     { 
      $meta = $metas->item($i); 
      if($meta->getAttribute('name') == 'keywords') 
       $keywords = $meta->getAttribute('content'); 
     } 

     //create new talk object and populate it 
     $talk = new Talk(); 
     //set its ted id from ted web site 
     $talk->setID($id); 
     //parse the name (name has un-needed char's in the end) 
     $talk->setName(substr($title, 0, strpos($title, '|'))); 

     //parse the String of tags to array 
     $keywords = explode(",", $keywords); 
     //remove un-needed items from it 
     $keywords=array_diff($keywords, array("TED","Talks")); 

     //add the filters tags to the talk 
     $talk->setTags($keywords); 

     //add to the total talks array 
     $tedTalks[]=$talk; 
    } 

    //move to the next ted talk ID to query 
    $id++; 
} //end of the while 

$time_end = microtime(true); 
$execution_time = ($time_end - $time_start); 
echo "this took (sec) : ".$execution_time; 

?> 
+0

您可以使用卷曲多模式并行地抓取页面。您也可以使用Yahoo Pipes进行调查,Yahoo Pipes会为您在页面中需要的特定数据进行抓取和解析。 – 2013-02-18 03:42:10

+0

Henley Chiu - 你能展示一个卷曲多模式的代码片段吗? – Nimrod007 2013-02-24 07:51:17

+0

我想这里有很好的例子http://php.net/manual/en/function.curl-multi-exec.php – 2013-03-01 13:57:39

回答