Perl WWW ::机械化存储网址，除非它已被发现

我有一个问题，我希望你可以帮忙吗？Perl WWW ::机械化存储网址，除非它已被发现

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 

    # compare 
    if ($mech->response->header('Content-Length') != $content_length) { 
     print "$child_url: different content length: $content_length vs " 
     . $mech->response->header('Content-Length') . "!\n"; 
     #HERE I want to store the urls that are found to have different content 
     #lengths to the base url 
     #only if the same url has not already been stored 
    } elsif ($mech->response->header('Content-Length') == $content_length) { 
     print "Content lengths are the same\n"; 
     #HERE I want to store the urls that are found to have the same content 
     #length as the base url 
     #only if the same url has not already been stored 
    } 
    } 
}

我遇到的问题：

正如你可以在代码中看到上面我想存储的URL取决于如果内容长度相同或不同，所以我会最终得到一组具有与其基本URL不同的内容长度的URL，并且最终将得到另一组具有与其基本URL相同内容长度的URL。

我知道如何做到这一点很容易使用数组

push (@differentContentLength, $url); 
push (@sameContentLength, $url);

但我将如何去使用这个散列（或另一种首选方法）？

我仍然得到与哈希交手所以你的帮助将非常感激，

非常感谢

来源

2013-02-13 perl-user

您应该在您的循环中添加右括号。 – simbabque 2013-02-13 12:18:09

@simbabque - 是你的权利，道歉 – 2013-02-13 12:25:07

您可以创建一个hashref到将所有网址存储在循环之外。我们称之为$content_lengths。这是一个标量，因为它是对散列的引用。在您的$child_url循环中，将内容长度添加到该数据结构。我们将首先使用基础网址，在$content_lengths->{$url}内部给我们另一个hashref。我们决定是否需要equal或different。在这两个键的内部将会有另一个保存$child_url的hashref。他们反过来将他们的内容长度作为价值。当然，如果你不想保存长度，我们可以在这里说++。

my $content_lengths; # this is at the top 
foreach my $url (# ... more stuff 

# compare 
if ($mech->response->header('Content-Length') != $content_length) { 
    print "$child_url: different content length: $content_length vs " 
    . $mech->response->header('Content-Length') . "!\n"; 

    # store the urls that are found to have different content 
    # lengths to the base url only if the same url has not already been stored 
    $content_lengths->{$url}->{'different'}->{$child_url} = $mech->response->header('Content-Length'); 

} elsif ($mech->response->header('Content-Length') == $content_length) { 
    print "Content lengths are the same\n"; 

    # store the urls that are found to have the same content length as the base 
    # url only if the same url has not already been stored 
    $content_lengths->{$url}->{'equal'}->{$child_url} = $mech->response->header('Content-Length'); 
}

来源

2013-02-13 12:25:34 simbabque

当你说'使用++，如果你不想要长度被存储'，这应该如何写'$ content_lengths - > {$ url} - > {'不同的'} - > {$ child_url} ++;'？并澄清，究竟是什么'++'在做什么？ – 2013-02-13 14:04:56

@ perl-user是的，这就是我的意思。它是增量速记运算符。它向左侧的var添加1并分配它。所以他们都有价值1.如果其中一个网站被看到两次，价值将是2.这是如何计数器和'记住名字，但不关心有多少'实施。你可以用'keys'来访问它。把它想象成SQL中的“GROUP BY”。 – simbabque 2013-02-13 15:53:45

哦，我看到现在如何防止重复（使用Data :: Dumper），而不是在其中添加另一个重复的url只是通过增加分配给该url的数字来注册它的存在，这并不重要，因为我们对该部分不感兴趣，感谢你的上面的评论解释得很好:) – 2013-02-13 16:40:17

请检查该解决方案：

my %content_length; 

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 
    my $new_content_length = $mech->response->header('Content-Length'); 
    # store in hash 
    print "New URL! url: $child_url\n" if ! defined $content_length{$child_url}; 
    print "Different content_length! url: $child_url, old_content_length: $content_length, new_content_length: $new_content_length\n" if $new_content_length != $content_length{$child_url}; 
    $content_length{$child_url} = $new_content_length; 
    } 
}

来源

2013-02-13 11:11:24 user1126070

Perl WWW ::机械化存储网址，除非它已被发现

回答

相关问题