2010-09-21 115 views
0

我不能似乎得到这个代码peice的工作:为什么我无法使用Perl的LWP :: Simple获取www.google.com?

$self->{_current_page} = $href; 
    my $response = $ua->get($href); 
    my $responseCode = $response->code; 
    if($responseCode ne "404") { 
     my $content = LWP::Simple->get($href); 
     die "get failed: " . $href if (!defined $content); 
    } 

将返回错误:get failed: http://www.google.com

完整的代码如下:

#!/usr/bin/perl 
use strict; 
use URI; 
use URI::http; 
use File::Basename; 
use DBI; 
use LWP::Simple; 
require LWP::UserAgent; 
my $ua = LWP::UserAgent->new; 
$ua->timeout(10); 
$ua->env_proxy; 
$ua->max_redirect(0); 


package Crawler; 
sub new { 
    my $class = shift; 
    my $self = { 
     _url => shift, 
     _max_link => 0, 
     _local => 1 
    }; 
    bless $self, $class; 
    return $self; 

} 
sub trim{ 
    my($self, $string) = @_; 
    $string =~ s/^\s+//; 
    $string =~ s/\s+$//; 
    return $string; 
} 
sub process_image { 
    my ($self, $process_image) = @_; 
    $self->{_process_image} = $process_image; 
} 
sub local { 
    my ($self, $local) = @_; 
    $self->{_local} = $local; 
} 
sub max_link { 
    my ($self, $max_link) = @_; 
    $self->{_max_link} = $max_link; 
} 
sub x_more { 
    my ($self, $x_more) = @_; 
    $self->{_x_more} = $x_more; 
} 
sub resolve_href { 
    my ($base, $href) = @_; 
    my $uri = URI->new($href); 
    return $uri->rel($base);  
} 
sub write { 
    my ($self, $ref, $data) = @_; 
    open FILE, '>c:/perlscripts/' . $ref . '_' . $self->{_process_image} . '.txt'; 
    foreach($data) { 
     print FILE $self->trim($_) . "\n"; 
    } 
    close(FILE); 
} 
sub scrape { 
    my @m_error_array; 
    my @m_href_array; 
    my @href_array; 
    my ($self, $DBhost, $DBuser, $DBpass, $DBname) = @_; 
    my ($dbh, $query, $result, $array); 
    my $DNS = "dbi:mysql:$DBname:$DBhost:3306"; 
    $dbh = DBI->connect($DNS, $DBuser, $DBpass) or die $DBI::errstr; 
    if(defined($self->{_process_image}) && (-e 'c:/perlscripts/href_w_' . $self->{_process_image} . ".txt")) { 
     open ERROR_W, "<c:/perlscripts/error_w_" . $self->{_process_image} . ".txt"; 
     open M_HREF_W, "<c:/perlscripts/m_href_w_" . $self->{_process_image} . ".txt"; 
     open HREF_W, "<c:/perlscripts/href_w_" . $self->{_process_image} . ".txt"; 
     @m_error_array = <ERROR_W>; 
     @m_href_array = <M_HREF_W>; 
     @href_array = <HREF_W>; 
     close (ERROR_W); 
     close (M_HREF_W); 
     close (HREF_W); 
    }else{ 
     @href_array = ($self->{_url}); 
    } 
    my $z = 0; 
    while(@href_array){ 
     if(defined($self->{_x_more}) && $z == $self->{_x_more}) { 
      last; 
     } 
     if(defined($self->{_process_image})) { 
      $self->write('m_href_w', @m_href_array); 
      $self->write('href_w', @href_array); 
      $self->write('error_w', @m_error_array); 
     } 
     $self->{_link_count} = scalar @m_href_array; 
     my $href = shift(@href_array); 
     my $info = URI::http->new($href); 
     my $host = $info->host; 
     $host =~ s/^www\.//; 
     $result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')"); 
     if(! $result->execute()){ 
      $result = $dbh->prepare("CREATE TABLE `" . $host . "` (`ID` INT(255) NOT NULL AUTO_INCREMENT , `URL` VARCHAR(255) NOT NULL , PRIMARY KEY (`ID`)) ENGINE = MYISAM ;"); 
      $result->execute(); 
     } 
     $self->{_current_page} = $href; 
     my $response = $ua->get($href); 
     my $responseCode = $response->code; 
     if($responseCode ne "404") { 
      my $content = LWP::Simple->get($href); 
      die "get failed: " . $href if (!defined $content); 
     } 
     #print $responseCode; 
    } 
} 
1; 

#$query = "SELECT * FROM `actwebdesigns.co.uk` ORDER BY ID DESC"; 
#$result = $dbh->prepare($query); 
#$result->execute(); 
#while($array = $result->fetchrow_hashref()) { 
# print $array->{'URL'} . "\n"; 
#} 

编辑:

仍然无法处理重定向。

my $redirect_limit = 10; 
    my $y = 0; 
    while(1 && $y le $redirect_limit) { 
     my $response = $ua->get($href); 
     my $responseCode = $response->code; 
     if($responseCode == 200 || $responseCode == 301 || $responseCode == 302) { 
      if($responseCode == 301 || $responseCode == 302) { 
       $href = $response->header('Location'); 
      }else{ 
       last; 
      } 
     }else{ 
      push(@m_error_array, $href); 
      last; 
     } 
     $y++; 
    } 
    if($y ne $redirect_limit) { 
     if(! defined($self->{_url_list})) { 
      my @url_list = ($href); 
     }else{ 
      my @url_list = $self->{_url_list}; 
      push(@url_list, $href); 
      $self->{_url_list} = @url_list; 
     } 
     my $content = LWP::Simple->get($href); 
     die "get failed: " . $href if (!defined $content); 

     #$result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')"); 
     #if(! $result->execute()){ 
     # $result = $dbh->prepare("CREATE TABLE `" . $host . "` (`ID` INT(255) NOT NULL AUTO_INCREMENT , `URL` VARCHAR(255) NOT NULL , PRIMARY KEY (`ID`)) ENGINE = MYISAM ;"); 
     # $result->execute(); 
     #} 
     print "good"; 
    }else{ 
     push(@m_error_array, $href); 
    } 
+1

你为什么要抓取的网页两次?使用'$ response-> content'而不是'LWP :: Simple-> get($ href)''。 – cjm 2010-09-21 07:37:27

+0

谢谢你,对钱! – 2010-09-21 07:56:29

+2

如果遇到问题,请将其缩小为演示问题的最短示例脚本。在你的程序中删除其他所有内容。找出哪些数据触发问题。通常,通过这个过程,你会发现你的答案 – 2010-09-21 11:55:12

回答

2

您应该检查响应代码以查看发生了什么(您已在检查404)。我收到一个302 - 重定向。

例如:

die "get failed ($responseCode): " . $href if (!defined $content); 

结果消息:

get failed (302): http://www.google.com at goog.pl line 20. 
+0

哈,这是一个什么东西!受信任的google.com本地化为.co.uk ...应该知道这一点。问候! – 2010-09-21 06:24:44

+0

仍然无法正常工作,请参阅上述修正。 – 2010-09-21 07:03:21

2

一对夫妇的想法。

1 /您似乎在使用字符串比较运算符(le,ne)来比较数字。您应该使用数字比较运算符(< =,!=)代替。

2您从LWP::UserAgent::get调用返回的值是HTTP::Response对象。明智地使用该类的“is_foo”方法可能会使您的代码更清晰一些。

我不知道这些是否能解决您的问题。但他们会提高你的代码的质量。

1

这是你的问题:

my $content = LWP::Simple->get($href); 

这将字符串“LWP ::简单”作为第一个参数“得到”。你想:

my $content = LWP::Simple::get($href); 
-1

检查你的SELinux设置。

已启用SELINUX的系统将不允许来自Web代理(httpd)的传出连接。

此页面可以告诉你更多关于SELinux和HTTPD设置: http://wiki.centos.org/TipsAndTricks/SelinuxBooleans

在Perl脚本启动Apache的出站Web连接:

# setsebool -P httpd_can_network_connect on 
+0

如果您正在回答一个问题,请评论原因。 – 2013-02-12 22:33:39