2016-03-07 50 views
1

我有一个rails服务器日志文件,其格式如下。需要创建正则表达式分析rails服务器日志

Started <REQUEST_TYPE_1> <URL_1> for <IP_1> at <TIMESTAMP_1> 
    Processing by <controller#action_1> as <REQUEST_FORMAT_1> 
    Parameters: <parameters_1> 
<Some logs from code> 
Rendered <some_template_1> (<timetaken_1>) 
Completed <RESPONSE_CODE_1> in <TIME_1> 


Started <REQUEST_REQUEST_TYPE_2> <URL_2> for <IP_2> at <TIMESTAMP_2> 
    Processing by <controller#action_2> as <REQUEST_FORMAT_2> 
    Parameters: <parameters_2> 
<Some logs from code> 
Completed <RESPONSE_CODE_2> in <TIME_2> 

现在,我需要分析该日志,并提取所有的REQUEST_TYPEURLIPTIMESTAMPREQUEST_FORMATRESPONSE_CODE从上面的日志。我努力在java/ruby​​中为它创建一个很好的正则表达式。实际输入中不存在<>。我添加了可读性和屏蔽实际数据。

请求示例:

Started GET "/google.com/2" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015 
    Processing by MyController#method as JS 
    Parameters: {"abc" => "xyz"} 
[LOG] 3 : User text log 
Completed 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms) 


Started POST "/google.com/543" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015 
    Processing by MyController#method_2 as JSON 
    Parameters: {"efg" => "uvw"} 
Completed 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms) 

预期输出:

request_types = ['GET', 'POST'] 
urls = ['/google.com/2','/google.com/543'] 
ips = ['127.0.0.1','127.0.1.1'] 
timestamps = ['Tue Dec 01 12:01:13 +0530 2015','Tue Dec 01 13:13:16 +0530 2015'] 
request_formats = ['JS','JSON'] 
response_codes = ['200 OK','404 Not Authorized'] 

我能写出下面的正则表达式,但预期它不工作。

request_types = /Started \w+/ //Expected array of all request types 
urls = /"\/.*\/"/ //Expected array of all urls types 
ips = /"d{1,3}.d{1,3}.d{1,3}.d{1,3}"/ //Expected array of all ips types 
timestamps = /at \w+/ 
request_formats =/as \w+/ 
response_codes = /Completed \w+/ 

我希望能得到来自于JAVA/RUBY给定的输入提取这个参数来创建正则表达式的一些帮助。如果可能,我更喜欢Java。

+0

您的原始日志文件是否也有这些括号('<>')? – Jan

+0

没有。这只是掩盖实际数据 – Abhishek

+0

像https://regex101.com/r/uI6oV1/3之类的东西? – Jan

回答

2

这是一个Java代码片段展示如何从日志细节成单独的数组列表中的Java:

String re = "(?sm)^Started\\s+(?<requesttype>\\S+)\\s+\"(?<url>\\S+)\"\\s+for\\s+(?<ip>\\d+(?:\\.\\d+)+)\\s+at\\s+(?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})\\s+(?:Processing\\s+by\\s+\\S+)\\s+as\\s+(?<requestformat>\\S+)(?:\\s+Parameters:\\s+\\S+)?(?:(?:(?:(?!\nStarted).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))?"; 
String str = "Started GET \"/google.com/2\" for 127.0.0.1 at Tue Dec 01 12:01:13 +0530 2015\n Processing by MyController#method as JS\n Parameters: {\"abc\" => \"xyz\"}\n[LOG] 3 : User text log\nCompleted 200 OK in 26ms (Views: 3.3ms | ActiveRecord: 2.9ms)\n\n\nStarted POST \"/google.com/543\" for 127.0.1.1 at Tue Dec 01 13:13:16 +0530 2015\n Processing by MyController#method_2 as JSON\n Parameters: {\"efg\" => \"uvw\"}\nCompleted 404 Not Authorized in 65ms (Views: 1.5ms | ActiveRecord: 1.0ms)"; 
Pattern pattern = Pattern.compile(re); 
Matcher matcher = pattern.matcher(str); 
List<String> requesttypes = new ArrayList<String>(); 
List<String> urls = new ArrayList<String>(); 
List<String> ips = new ArrayList<String>(); 
List<String> timestamps = new ArrayList<String>(); 
List<String> requestformats = new ArrayList<String>(); 
List<String> responsecodes = new ArrayList<String>(); 
while (matcher.find()){ 
    requesttypes.add(matcher.group("requesttype")); 
    urls.add(matcher.group("url")); 
    ips.add(matcher.group("ip")); 
    timestamps.add(matcher.group("tsp")); 
    requestformats.add(matcher.group("requestformat")); 
    responsecodes.add(matcher.group("responsecode")); 
    System.out.println("-----------------------"); 
    System.out.println(matcher.group("requesttype")); 
    System.out.println(matcher.group("url")); 
    System.out.println(matcher.group("ip")); 
    System.out.println(matcher.group("tsp")); 
    System.out.println(matcher.group("requestformat")); 
    System.out.println(matcher.group("responsecode")); 
} 

IDEONE demo。在完成匹配之后,您甚至可以打印阵列,例如System.out.println(urls)

System.out.println(requesttypes); 
System.out.println(urls); 
System.out.println(ips); 
System.out.println(urls); 
System.out.println(timestamps); 
System.out.println(requestformats); 
System.out.println(responsecodes); 

参见this demo。输出是:

[GET, POST] 
[/google.com/2, /google.com/543] 
[127.0.0.1, 127.0.1.1] 
[/google.com/2, /google.com/543] 
[Tue Dec 01 12:01:13 +0530 2015, Tue Dec 01 13:13:16 +0530 2015] 
[JS, JSON] 
[200 OK, 404 Not Authorized] 

正则表达式匹配:

  • (?sm)^ - 一行的开始(由于^?m选项)
  • Started\\s+ - 字面Started串和1+空格
  • (?<requesttype>\\S+) - 持有1个非空白字符的组“请求类型”
  • \\s+\" - 1+空白,接着用"
  • (?<url>\\S+) - 组 “URL” 保持1+非空白
  • \"\\s+for\\s+ - "随后与1+空格+ for + 1 +空白
  • (?<ip>\\d+(?:\\.\\d+)+) - 包含数字IP组+ . +位数(. +数字1+次)
  • \\s+at\\s+ - 时间戳组保持字母a - 与空白
  • (?<tsp>[a-zA-Z]+\\s+[a-zA-Z]+\\s+\\d+\\s+\\d+:\\d+:\\d+\\s+\\+\\d+\\s\\d{4})包围的单词at nd数字以不同的顺序与空格符分开。到输入例子
    • \\s+ - 1+空白
  • (?:Processing\\s+by\\s+\\S+)\\s+as\\s+ - Processing by随后与一些字(1+非空格),然后用空白
  • (?<requestformat>\\S+)包围的字as - 组“请求格式“,其中包含非空白符号
  • (?:\\s+Parameters:\\s+\\S+)? - 可选组Paramters:后跟whitepaces(s)和某个单词
  • (?:(?:(?:(?!\nStarted).)*Completed\\s)(?<responsecode>\\d+(?:(?!\\sin\\s).)*))? - 匹配最多Completed任何字符的可选基团(因为封闭在(?:...)?),但不具有Started(由于回火贪婪令牌(?:(?!\nStarted).)*),然后匹配Completed遵循的空白,然后(?<responsecode>\\d+(?:(?!\\sin\\s).)*)比赛和捕获到组“响应代码”数字后跟随任何字符,直到包含空格的整个单词in
+0

不错。你可以分享更多关于你如何构建这个正则表达式?我的意思是,如何理解这个正则表达式? – Abhishek

+0

假设我不想要时间戳。所以,我删除了'(? [a-zA-Z] + \\ s + [a-zA-Z] + \\ s + \\ d + \\ s + \\ d +:\\ d +:\\ d + \\ s + \\ + \\ d + \\ s \\ d {4})'从正则表达式,它说不匹配:|。 https://regex101.com/r/iN7yO3/3。这怎么可能?我究竟做错了什么? – Abhishek

+0

你不能只删除它。通过使用'(?:'和')?'使其成为可选项。 –