2015-04-06 34 views
-2

我想使用正则表达式获取日志(txt文件)的一部分,但我需要一些帮助。基本上日志就这样产生了:提取Python中的日志部分以导入到Excel

Tue Feb 24 17:51:10.835 SRV02 NOTICE Event Loop - noop 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  Exponential histogram: 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 0]: <  0.001: 728941854 
Tue Feb 24 17:51:10.835 SRV02 NOTICE Event Loop - noop: samples: 728941854; avg: 0.00; min: 0.00; max: 0.00 
Tue Feb 24 17:51:10.835 SRV02 NOTICE Data Quality Monitor Thread Processing Time 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  Exponential histogram: 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 4]: <  0.016:   3 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 5]: <  0.032:  23 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 6]: <  0.064:  14 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 7]: <  0.128:   4 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 8]: <  0.256:   6 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[ 9]: <  0.512:   1 
Tue Feb 24 17:51:10.835 SRV02 NOTICE  hist[10]: <  1.024:   2 
Tue Feb 24 17:51:10.835 SRV02 NOTICE Data Quality Monitor Thread Processing Time: samples: 53; avg: 0.08; min: 0.01; max: 0.67 
Tue Feb 24 17:51:10.835 SRV02 NOTICE Client Hugepage Memory: 649/4096 MB 
Tue Feb 24 17:51:10.836 SRV02 NOTICE DQM: Num R: 0 RD: 0 ED: 0 W: 0 WH: 0 Q: 0 D: 0 DF: 0 
Tue Feb 24 17:51:10.836 SRV02 NOTICE Num G: 0 M: 0 S: 0 D: 0 U: 0 R: 0 N: 0 
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_template_allocs      =   4 
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_template_frees      =   0 
Tue Feb 24 17:51:10.836 SRV02 NOTICE num_internal_book_allocs     =   24 

,我需要得到有关“指数直方图”的信息,所以,在这个例子中,我需要确定字符串“指数直方图”,并让所有的“直方图[ ...“导入到电子表格。此外,我需要这样的信息:

samples: XX; avg: X.XX; min: X.XX; max: X.XX 

所以,在上面的例子中,我需要提取和重新安排这样的数据,其中“事件循环 - 空操作”和“数据质量监控线程处理时间”需要在每一行重复,以确定直方图:

Event Loop - noop;hist[ 0];0.001;728941854 
Event Loop - noop;samples;728941854;avg;0.00;min;0.00;max;0.00 
Data Quality Monitor Thread Processing Time;hist[ 4];0.016;3 
Data Quality Monitor Thread Processing Time;hist[ 5];0.032;23 
Data Quality Monitor Thread Processing Time;hist[ 6];0.064;14 
(...) 
Data Quality Monitor Thread Processing Time;hist[ 10];1.024;2 
Data Quality Monitor Thread Processing Time;samples;53;avg;0.08;min;0.01;max;0.67 

有人可以帮助我如何做到这一点?谢谢!

回答

1

在您的示例输出中,您的示例输入中不存在数据。具体来说,你有更多的"Data Quality Monitor Thread Processing Time"字符串,然后在你的数据。看起来你想保留最近的缩进标题?

无论如何,我认为这将是更容易使用一些不同的正则表达式语句,而不是试图让一个包罗万象的一个直接拔掉数据:

import re 
hists = re.findall(r'(hist\[\s\d+\]).*?(\d+\.\d+).*?(\d+)',input) 
sample_avg_etc = re.findall(r'(samples): (\d+); (avg): (\d+\.\d+); (min): (\d+\.\d+); (max): (\d+\.\d+)',input) 

如果你需要保持局部标头为你显示在您的示例输出中。我不认为你想使用正则表达式。相反,只需编写一个解析器来提取数据。

您可以通过剥离其Tue Feb 24 17:51:10.835 SRV02 NOTICE的每一行,然后逐行定位数据,并跟踪最后一个标头来开始。看到评论,下面的回报正是你上面列出的:

import re 

def parse(data): 
    lines = data.split('\n') # get the lines by splitting on the newline char 
    lines = [line[len("Tue Feb 24 17:51:10.835 SRV02 NOTICE "):] for line in lines] # remove the number of characters equal to the logging info 
    out = [] 
    header = '' 
    for line in lines: 
     if line.startswith(' '): 
      if line.strip().startswith('hist'): 
       out.append(header + ";" + extract_hist_data(line)) # outsource the specific extracting to a function for ease of readability 
     else:      # header/samples line 
      if all(i in line for i in ("samples", "avg", "min", "max")): # if the line contains all these keywords 
       out.append(header + ";" + extract_stat_data(line)) # outsource the specific extracting to a function for ease of readability 
      else: # Treat as a header 
       header = line 
    return '\n'.join(out) 

def extract_hist_data(line): 
    data = re.findall(r'(hist\[\s*?\d+\]).*?(\d+\.\d+).*?(\d+)',line) 
    if len(data) > 0: 
     data = data[0] 
    else: 
     return "" 
    return ';'.join(i for i in data) 

def extract_stat_data(line): 
    data = re.findall(r'(samples).*?(\d+).*?(avg).*?(\d+\.\d+).*?(min).*?(\d+\.\d+).*?(max).*?(\d+\.\d+)',line) 
    if len(data) > 0: 
     data = data[0] 
    else: 
     return "" 
    return ';'.join(i for i in data) 

def parse_log_file(log_file_path): 
    with open(log_file_path,'r') as f: 
     content = ''.join(i for i in f) 
    return parse(content) 

print parse_log_file('test.log') 
+0

嘿乔,非常感谢你的惊人脚本。我编辑的问题更容易理解。您的脚本非常接近理想的解决方案,缺少的仅仅是重复所有行(“Event Loop - noop”和“Data Quality Monitor线程处理时间”)中的标题(直方图名称),所以我可以知道哪些我的数据属于的直方图。运行脚本我有这样的输出:hist [4]:<0.016:3; samples; 53; avg; 0.08; min; 0.01; max; 0.67;但是我需要类似Data Quality Monitor线程处理时间; hist [4]; 0.016; 3 –

+0

@ user179589您是否运行了第二位代码?它保留了正确的标题。如果你说失踪,我认为你没有充分利用我的答案。 –

+1

哦,你说得对,我错过了最后一部分。非常感谢你,问题解决了! –