在Perl中分析并规范包含2-3百万行的文件

我有一个包含数百万行（2-4行）的日志文件，其中包含IP，端口，电子邮件ID，域，PID等一些特殊信息。在Perl中分析并规范包含2-3百万行的文件

我需要解析和规范化文件，以便所有上述特殊标记都将被IP，PORT，EMAIL，DOMAIN等常量字符串替换，并且需要提供所有重复行的计数。

即，用于具有内容文件象下面 -

Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.3.1 is not reachable 
Aug 19 10:22:48 user 10.1.4.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.5 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable

规格化输出将是 -

MONTH DAY TIME user IP is not reachable =======> Count = 14

日志行可以有多个令牌是搜索和替换样结构域，电子邮件IDS。 下面的代码我已经写了正在16分钟日志文件的10MB（使用邮件服务器日志）

是否有可能当你有解析，很多线路的一些正则表达式，以最小化的Perl的时间和替代操作来执行。

的代码片段我已经写了是 -

use strict; 
use warnings; 

use Tie::Hash::Sorted; 
use Getopt::Long; 
use Regexp::Common qw(net URI Email::Address); 
use Email::Address; 

my $ignore = 0; 
my $threshold = 0; 
my $normalize = 0; 
GetOptions(
    'ignore=s' => \$ignore, 
    'threshold=i' => \$threshold, 
    'normalize=i' => \$normalize, 
); 

my (%initial_log, %Logs, %final_logs); 
my ($total_lines, $threshold_value); 
my $file = shift or die "Usage: $0 FILE\n"; 

open my $fh, '<', $file or die "Could not open '$file' $!"; 

#Sort the results according to frequency 
my $sort_by_numeric_value = sub { 
    my $hash = shift; 
    [ sort { $hash->{$b} <=> $hash->{$a} } keys %$hash ]; 
}; 

#Ignore "ignore" number fields from each line 
while (my $line = <$fh>) { 
    my $skip_words = $ignore; 

    chomp $line; 
    $total_lines++; 

    if ($ignore) { 
     my @arr = split(/[\s\t]+/smx, $line); 
     while ($skip_words-- != 0) { shift @arr; } 
     my $n_line = join(' ', @arr); 
     $line = $n_line; 
    } 

    $initial_log{$line}++; 
} 

close $fh or die "unable to close: $!"; 

$threshold_value = int(($total_lines/100) * $threshold); 

tie my %sorted_init_logs, 'Tie::Hash::Sorted', 
    'Hash'   => \%initial_log, 
    'Sort_Routine' => $sort_by_numeric_value; 

%final_logs = %sorted_init_logs; 

if ($normalize) { 
    # Normalize the logs 
    while (my ($line, $count) = (each %final_logs)) { 
     $line = normalize($line); 
     $Logs{$line} += $count; 
    } 
    %final_logs = %Logs; 
} 

tie my %sorted_logs, 'Tie::Hash::Sorted', 
    'Hash'   => \%final_logs, 
    'Sort_Routine' => $sort_by_numeric_value; 

my $reduced_lines = values(%final_logs); 
my $reduction = int(100 - ((values(%final_logs)/$total_lines) * 100)); 

print("Number of line in the original logs  = $total_lines"); 
print("Number of line in the normalized logs = $reduced_lines"); 
print("Logs reduced after normalization  = $reduction%\n"); 

# Show the logs below threshold value only 
while (my ($log, $count) = (each %sorted_logs)) { 

    if ($count >= $threshold_value) { 
     printf "%-80s ===========> [%s]\n", $log, $sorted_logs{$log}; 
    } 
} 

sub normalize { 
    my $input = shift; 

    # Remove unwanted charecters 
    $input =~ s/[()]//smxg; 

    # Normalize the URI 
    $input =~ s/$RE{URI}{HTTP}/URI/smxg; 

    # Normalize the IP Addresses 
    $input =~ s/$RE{net}{IPv4}/IP/smgx; 
    $input =~ s/IP(\W+)\d+/IP$1PORT/smxg; 
    $input =~ s/$RE{net}{IPv4}{hex}/HEX_IP/smxg; 
    $input =~ s/$RE{net}{IPv4}{bin}/BINARY_IP/smxg; 
    $input =~ s/\b$RE{net}{MAC}\b/MAC/smxg; 

    # Normalize the Email Addresses 
    $input =~ s/(\w+)=$RE{Email}{Address}/$1=EMAIL/smxg; 
    $input =~ s/$RE{Email}{Address}/EMAIL/smxg; 

    # Normalize the Domain name 
    $input =~ s/[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(?:\.[A-Za-z]{2,})/HOSTNAME/smxg; 
    return $input; 
}

来源

2014-08-28 CodeQuestor

你可以试试[杰韦利:: NYTProf（HTTPS： //metacpan.org/pod/Devel::NYTProf）。 – Biffen 2014-08-28 06:12:20

我已经删除了Tie :: Hash :: Sorted，看起来现在时序已经改进了。在normalize（）方法中可以做些什么吗？ – CodeQuestor 2014-08-28 07:34:32

“smx”正则表达式标志在你的情况下是无用的。您可以尝试使用o标志来编译正则表达式。 – Sorin 2014-08-28 09:53:17

尤其是如果你不知道确切类型的查询你需要去执行，你会好得多把解析的日志数据到一个SQLite数据库。以下示例使用临时数据库说明了这一点。如果您想针对相同的数据运行多个不同的查询，请解析一次，将它们加载到数据库中，然后查询您心中的内容。这应该是比现在你在做什么更快，但是，很明显，我没有任何测量：

#!/usr/bin/env perl 

use strict; 
use warnings; 

use DBI; 

my $dbh = DBI->connect('dbi:SQLite::memory:', undef, undef, 
    { 
     RaiseError => 1, 
     AutoCommit => 0, 
    } 
); 

$dbh->do(q{ 
    CREATE TABLE 'status' (
     id  integer primary key, 
     month char(3), 
     day  char(2), 
     time char(8), 
     agent varchar(100), 
     ip  char(15), 
     status varchar(100) 
    ) 
}); 

$dbh->commit; 

my @cols = qw(month day time agent ip status); 

my $inserter = $dbh->prepare(sprintf 
    q{INSERT INTO 'status' (%s) VALUES (%s)}, 
    join(',', @cols), 
    join(',', ('?') x @cols) 
); 

while (my $line = <DATA>) { 
    $line =~ s/\s+\z//; 
    $inserter->execute(split ' ', $line, scalar @cols); 
} 

$dbh->commit; 

my $summarizer = $dbh->prepare(q{ 
    SELECT 
     month, 
     day, 
     time, 
     agent, 
     ip, 
     status, 
     count(*) as count 
    FROM status 
    GROUP BY month, day, time, agent, ip, status 
    } 
); 

$summarizer->execute; 
my $result = $summarizer->fetchall_arrayref; 
print "@$_\n" for @$result; 

$dbh->disconnect; 

__DATA__ 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.3.1 is not reachable 
Aug 19 10:22:48 user 10.1.4.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.5 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 
Aug 19 10:22:48 user 10.1.1.1 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 
Aug 19 10:22:48 user 10.1.1.6 is not reachable

输出：

Aug 19 10:22:48 user 10.1.1.1 is not reachable 4 
Aug 19 10:22:48 user 10.1.1.4 is not reachable 3 
Aug 19 10:22:48 user 10.1.1.5 is not reachable 1 
Aug 19 10:22:48 user 10.1.1.6 is not reachable 5 
Aug 19 10:22:48 user 10.1.3.1 is not reachable 1 
Aug 19 10:22:48 user 10.1.4.1 is not reachable 1

来源

2015-02-11 22:09:10

在Perl中分析并规范包含2-3百万行的文件

回答

相关问题