2013-02-13 99 views
0

我有一个问题,我希望你可以帮忙吗?Perl WWW ::机械化存储网址,除非它已被发现

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 

    # compare 
    if ($mech->response->header('Content-Length') != $content_length) { 
     print "$child_url: different content length: $content_length vs " 
     . $mech->response->header('Content-Length') . "!\n"; 
     #HERE I want to store the urls that are found to have different content 
     #lengths to the base url 
     #only if the same url has not already been stored 
    } elsif ($mech->response->header('Content-Length') == $content_length) { 
     print "Content lengths are the same\n"; 
     #HERE I want to store the urls that are found to have the same content 
     #length as the base url 
     #only if the same url has not already been stored 
    } 
    } 
} 

我遇到的问题:

正如你可以在代码中看到上面我想存储的URL取决于如果内容长度相同或不同,所以我会最终得到一组具有与其基本URL不同的内容长度的URL,并且最终将得到另一组具有与其基本URL相同内容长度的URL。

我知道如何做到这一点很容易使用数组

push (@differentContentLength, $url); 
push (@sameContentLength, $url); 

但我将如何去使用这个散列(或另一种首选方法)?

我仍然得到与哈希交手所以你的帮助将非常感激,

非常感谢

+0

您应该在您的循环中添加右括号。 – simbabque 2013-02-13 12:18:09

+0

@simbabque - 是你的权利,道歉 – 2013-02-13 12:25:07

回答

1

您可以创建一个hashref到将所有网址存储在循环之外。我们称之为$content_lengths。这是一个标量,因为它是对散列的引用。在您的$child_url循环中,将内容长度添加到该数据结构。我们将首先使用基础网址,在$content_lengths->{$url}内部给我们另一个hashref。我们决定是否需要equaldifferent。在这两个键的内部将会有另一个保存$child_url的hashref。他们反过来将他们的内容长度作为价值。当然,如果你不想保存长度,我们可以在这里说++

my $content_lengths; # this is at the top 
foreach my $url (# ... more stuff 

# compare 
if ($mech->response->header('Content-Length') != $content_length) { 
    print "$child_url: different content length: $content_length vs " 
    . $mech->response->header('Content-Length') . "!\n"; 

    # store the urls that are found to have different content 
    # lengths to the base url only if the same url has not already been stored 
    $content_lengths->{$url}->{'different'}->{$child_url} = $mech->response->header('Content-Length'); 

} elsif ($mech->response->header('Content-Length') == $content_length) { 
    print "Content lengths are the same\n"; 

    # store the urls that are found to have the same content length as the base 
    # url only if the same url has not already been stored 
    $content_lengths->{$url}->{'equal'}->{$child_url} = $mech->response->header('Content-Length'); 
} 
+0

当你说'使用++,如果你不想要长度被存储',这应该如何写'$ content_lengths - > {$ url} - > {'不同的'} - > {$ child_url} ++;'?并澄清,究竟是什么'++'在做什么? – 2013-02-13 14:04:56

+1

@ perl-user是的,这就是我的意思。它是增量速记运算符。它向左侧的var添加1并分配它。所以他们都有价值1.如果其中一个网站被看到两次,价值将是2.这是如何计数器和'记住名字,但不关心有多少'实施。你可以用'keys'来访问它。把它想象成SQL中的“GROUP BY”。 – simbabque 2013-02-13 15:53:45

+0

哦,我看到现在如何防止重复(使用Data :: Dumper),而不是在其中添加另一个重复的url只是通过增加分配给该url的数字来注册它的存在,这并不重要,因为我们对该部分不感兴趣,感谢你的上面的评论解释得很好:) – 2013-02-13 16:40:17

1

请检查该解决方案:

my %content_length; 

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 
    my $new_content_length = $mech->response->header('Content-Length'); 
    # store in hash 
    print "New URL! url: $child_url\n" if ! defined $content_length{$child_url}; 
    print "Different content_length! url: $child_url, old_content_length: $content_length, new_content_length: $new_content_length\n" if $new_content_length != $content_length{$child_url}; 
    $content_length{$child_url} = $new_content_length; 
    } 
}