3
我有一个包含数百万行(2-4行)的日志文件,其中包含IP,端口,电子邮件ID,域,PID等一些特殊信息。在Perl中分析并规范包含2-3百万行的文件
我需要解析和规范化文件,以便所有上述特殊标记都将被IP,PORT,EMAIL,DOMAIN等常量字符串替换,并且需要提供所有重复行的计数。
即,用于具有内容文件象下面 -
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
规格化输出将是 -
MONTH DAY TIME user IP is not reachable =======> Count = 14
日志行可以有多个令牌是搜索和替换样结构域,电子邮件IDS。 下面的代码我已经写了正在16分钟日志文件的10MB(使用邮件服务器日志)
是否有可能当你有解析,很多线路的一些正则表达式,以最小化的Perl的时间和替代操作来执行。
的代码片段我已经写了是 -
use strict;
use warnings;
use Tie::Hash::Sorted;
use Getopt::Long;
use Regexp::Common qw(net URI Email::Address);
use Email::Address;
my $ignore = 0;
my $threshold = 0;
my $normalize = 0;
GetOptions(
'ignore=s' => \$ignore,
'threshold=i' => \$threshold,
'normalize=i' => \$normalize,
);
my (%initial_log, %Logs, %final_logs);
my ($total_lines, $threshold_value);
my $file = shift or die "Usage: $0 FILE\n";
open my $fh, '<', $file or die "Could not open '$file' $!";
#Sort the results according to frequency
my $sort_by_numeric_value = sub {
my $hash = shift;
[ sort { $hash->{$b} <=> $hash->{$a} } keys %$hash ];
};
#Ignore "ignore" number fields from each line
while (my $line = <$fh>) {
my $skip_words = $ignore;
chomp $line;
$total_lines++;
if ($ignore) {
my @arr = split(/[\s\t]+/smx, $line);
while ($skip_words-- != 0) { shift @arr; }
my $n_line = join(' ', @arr);
$line = $n_line;
}
$initial_log{$line}++;
}
close $fh or die "unable to close: $!";
$threshold_value = int(($total_lines/100) * $threshold);
tie my %sorted_init_logs, 'Tie::Hash::Sorted',
'Hash' => \%initial_log,
'Sort_Routine' => $sort_by_numeric_value;
%final_logs = %sorted_init_logs;
if ($normalize) {
# Normalize the logs
while (my ($line, $count) = (each %final_logs)) {
$line = normalize($line);
$Logs{$line} += $count;
}
%final_logs = %Logs;
}
tie my %sorted_logs, 'Tie::Hash::Sorted',
'Hash' => \%final_logs,
'Sort_Routine' => $sort_by_numeric_value;
my $reduced_lines = values(%final_logs);
my $reduction = int(100 - ((values(%final_logs)/$total_lines) * 100));
print("Number of line in the original logs = $total_lines");
print("Number of line in the normalized logs = $reduced_lines");
print("Logs reduced after normalization = $reduction%\n");
# Show the logs below threshold value only
while (my ($log, $count) = (each %sorted_logs)) {
if ($count >= $threshold_value) {
printf "%-80s ===========> [%s]\n", $log, $sorted_logs{$log};
}
}
sub normalize {
my $input = shift;
# Remove unwanted charecters
$input =~ s/[()]//smxg;
# Normalize the URI
$input =~ s/$RE{URI}{HTTP}/URI/smxg;
# Normalize the IP Addresses
$input =~ s/$RE{net}{IPv4}/IP/smgx;
$input =~ s/IP(\W+)\d+/IP$1PORT/smxg;
$input =~ s/$RE{net}{IPv4}{hex}/HEX_IP/smxg;
$input =~ s/$RE{net}{IPv4}{bin}/BINARY_IP/smxg;
$input =~ s/\b$RE{net}{MAC}\b/MAC/smxg;
# Normalize the Email Addresses
$input =~ s/(\w+)=$RE{Email}{Address}/$1=EMAIL/smxg;
$input =~ s/$RE{Email}{Address}/EMAIL/smxg;
# Normalize the Domain name
$input =~ s/[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(?:\.[A-Za-z]{2,})/HOSTNAME/smxg;
return $input;
}
你可以试试[杰韦利:: NYTProf(HTTPS: //metacpan.org/pod/Devel::NYTProf)。 – Biffen 2014-08-28 06:12:20
我已经删除了Tie :: Hash :: Sorted,看起来现在时序已经改进了。在normalize()方法中可以做些什么吗? – CodeQuestor 2014-08-28 07:34:32
“smx”正则表达式标志在你的情况下是无用的。您可以尝试使用o标志来编译正则表达式。 – Sorin 2014-08-28 09:53:17