2011-10-10 51 views
0

代码简单地浸入页面并获取指定表格中的所有表格内容,并将其插入到我的数据库中并与之对应。简化代码以加速PHP刮板

它做起来很慢,我需要的想法,以简化它的工作速度更快

<?php 

设置环路

$pagenumber = 1001; 

while ($pagenumber <= 5000) { 

获取内容

$url = "http://www.example.com/info.php?num=$pagenumber"; 
$raw = file_get_contents($url); 

$newlines = array("\t","\n","\r","&nbsp;","\0","\x0B"); 
$content = str_replace($newlines, '', $raw); 

$start = strpos($content,'>Details<'); 
$end = strpos($content,'</table>',$start); 
$table1 = substr($content,$start,$end-$start); 
// $table1 = strip_tags($table1); 

最先被命名

$start = strpos($table1,'<td'); 
$end = strpos($table1,'<br />',$start); 
$fnames = substr($table1,$start,$end-$start); 
$fnames = strip_tags($fnames); 
$fnames = preg_replace('/\s\s+/', '', $fnames); 

得姓

$start = strpos($table1,'<br />'); 
$end = strpos($table1,'</td>',$start); 
$lnames = substr($table1,$start,$end-$start); 
$lnames = strip_tags($lnames); 
$lnames = preg_replace('/\s\s+/', '', $lnames); 

获取手机

$start = strpos($table1,'Phone:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$phone = substr($table1,$start,$end-$start); 
$phone = strip_tags($phone); 
$phone = str_replace("Phone:", "" ,$phone); 
$phone = preg_replace('/\s\s+/', '', $phone); 

获取地址

$start = strpos($table1,'Address:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$ad = substr($table1,$start,$end-$start); 
$ad = strip_tags($ad); 
$ad = str_replace("Address:", "" ,$ad); 
$ad = preg_replace('/\s\s+/', '', $ad); 

得到公寓没有

$start = strpos($table1,'Apt:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$apt = substr($table1,$start,$end-$start); 
$apt = strip_tags($apt); 
$apt = str_replace("Apt:", "" ,$apt); 
$apt = preg_replace('/\s\s+/', '', $apt); 

得到国家

$start = strpos($table1,'Country:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$country = substr($table1,$start,$end-$start); 
$country = strip_tags($country); 
$country = str_replace("Country:", "" ,$country); 
$country = preg_replace('/\s\s+/', '', $country); 

得到城市

$start = strpos($table1,'City:<br />     State/Province:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$city = substr($table1,$start,$end-$start); 
$city = strip_tags($city); 
$city = str_replace("City:     State/Province:", "" ,$city); 
$city = preg_replace('/\s\s+/', '', $city); 

得到压缩

$start = strpos($table1,'Zip:'); 
$end = strpos($table1,'</td>    </tr>    <tr>',$start); 
$zip = substr($table1,$start,$end-$start); 
$zip = strip_tags($zip); 
$zip = str_replace("Zip:", "" ,$zip); 
$zip = preg_replace('/\s\s+/', '', $zip); 

获取电子邮件

$start = strpos($table1,'email:'); 
$end = strpos($table1,'</td>    </tr>',$start); 
$email = substr($table1,$start,$end-$start); 
$email = strip_tags($email); 
$email = str_replace("email:", "" ,$email); 
$email = preg_replace('/\s\s+/', '', $email); 

呼应行

echo "<tr> 
<td><a href='http://www.example.com/info.php?num=$pagenumber'>link</a></td> 
<td>$fnames</td> 
<td>$lnames</td> 
<td>$phone</td> 
<td>$ad</td> 
<td>$apt</td> 
<td>$country</td> 
<td>$city</td> 
<td>$zip</td> 
<td>$email</td> 
</tr>"; 

包括DB信息

include("inf.php"); 
$tablename = 'list'; 

$fnames = mysql_real_escape_string($fnames); 
$lnames = mysql_real_escape_string($lnames); 
$phone = mysql_real_escape_string($phone); 
$ad = mysql_real_escape_string($ad); 
$apt = mysql_real_escape_string($apt); 
$country = mysql_real_escape_string($country); 
$city = mysql_real_escape_string($city); 
$zip = mysql_real_escape_string($zip); 
$email = mysql_real_escape_string($email); 

插入行DB

$query = "INSERT INTO $tablename VALUES('', '$pagenumber', '$fnames', '$lnames', '$phone', '$ad', 

'$apt','$country','$city','$zip', '$email')"; 
mysql_query($query) or die(mysql_error()); 

重置循环

$pagenumber = $pagenumber + 1; 
} 

?> 
+1

你可以不试图建立数据说,数组,然后将其保存到数据库一次去?而不是刮>保存>刮,你会刮>刮>刮>一个大的保存。因此保存DB往返? – dougajmcdonald

+0

“慢”有多慢?读取一个网页并将其保存到数据库4000次永远不会特别快。 – Widor

+0

我不能让我的头,我该怎么做 – Dasa

回答

0

你可以看看卷曲

抓住页面后,我们可以通过一种模式来抓取所有必填字段。 可以使用preg_match_all完成匹配

也没有任何可用于您正在寻找的数据的xml/rss提要? 看看您是否可以在您的示例网站的每页上显示更多结果,这会减少您需要抓取的页面数量。

编辑: 的要求一个简单的例子:

确保您已卷曲你的服务器上启用:

echo 'cURL is '.(function_exists('curl_init') ?: ' not').' enabled'; 

      $ch = curl_init(); 

    curl_setopt ($ch, CURLOPT_URL, 'http://example.com'); 

    curl_setopt($ch, CURLOPT_REFERER, 'http://example.com'); 
    curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate'); 
    curl_setopt($ch, CURLOPT_AUTOREFERER, true); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
    curl_setopt($ch, CURLOPT_TIMEOUT, 5); 
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 

      $page =curl_exec ($ch); 
+0

我很喜欢使用卷曲,但我现在已经知道它是如何工作的,现在有简单的教程,所以我从来没有得到一个挂在它,设计网站即时刮的人只是在网页设计糟糕,因为我 – Dasa

+0

改为卷曲设置它以刮入巨大的二维数组,现在我需要插入到MySQL DB有关如何去做的任何想法 – Dasa

1

不要使用正则表达式的HTML。你应该使用XPath,并为PHP专门,DOMXPath