2015-06-22 48 views
2

我有一个包含时间戳和数据的日志文件(用','分隔)。我想要一个Python脚本来解析日志文件来计算每小时发生的数据数量。搜索文件中小时数的最佳方法

这里的日志文件内容的例子:

2015-06-18 09:04:04.377,54954.418 
2015-06-18 09:04:48.863,54965.438 
2015-06-18 09:05:29.080,49.813 
2015-06-18 09:06:04.697,45.187 
2015-06-18 09:06:40.719,45.238 
2015-06-18 09:07:09.693,38.768 
2015-06-18 09:07:35.856,36.315 
2015-06-18 09:08:06.961,39.789 
2015-06-18 09:08:33.241,36.147 
2015-06-18 09:09:02.801,38.473 
2015-06-18 09:09:36.559,44.839 
2015-06-18 09:10:13.222,46.165 
2015-06-18 09:10:47.867,44.115 
2015-06-18 09:11:25.807,46.985 
2015-06-18 09:12:00.512,43.607 
2015-06-18 09:12:37.513,46.552 
2015-06-18 09:13:10.408,41.507 
2015-06-18 10:13:44.107,43.269 
2015-06-18 10:14:20.501,47.001 
2015-06-18 10:15:00.061,52.589 
2015-06-18 11:15:33.501,42.148 
2015-06-18 11:16:07.558,43.919 
2015-06-18 11:16:41.851,43.369 
2015-06-18 11:17:15.159,43.336 
2015-06-18 11:17:47.217,40.965 
2015-06-18 11:18:23.135,44.12 
2015-06-18 11:18:55.547,41.432 
2015-06-18 12:19:32.362,45.522 
2015-06-18 12:20:04.456,42.339 
2015-06-18 12:20:36.559,40.555 
2015-06-18 12:21:08.409,40.534 
2015-06-18 12:21:38.170,38.706 
2015-06-18 12:22:09.108,38.653 
2015-06-18 12:22:34.420,33.234 
2015-06-18 12:23:01.319,35.665 

因此,对于这个,上午9点共有17个,上午10点有3个,等等... 我怎么能去这样做呢?

+2

到目前为止,您亲自尝试了什么?请提供一个描述您的问题的简单实例! –

+0

我是否理解正确:你想要计算相同日期和小时的行数? – Wolf

回答

0

这可以通过使用熊猫很容易做到:

import pandas as pd 
data = pd.read_csv('log.csv') 
data['time'] = pd.to_datetime(data['time']) 
data.index = data['time'] 
data['count'] = 1 
hour_count = data['count'].resample('1H',how='count') 
+0

*'这可以很容易完成'* - 好吧,它看起来绝对不是那么容易。在[Peter的回答](http://stackoverflow.com/a/30975312/2932052) – Wolf

5

您可以使用collections.Counter,这就好比一个直方图。

你真的只对该行的前13个字符感兴趣。你可以切这些,e.g:

>>> line = '2015-06-18 09:11:25.807,46.985' 
>>> line[:13] 
2015-06-18 09 

将其组合在一起:

data = """2015-06-18 09:11:25.807,46.985 
2015-06-18 09:12:00.512,43.607 
2015-06-18 09:12:37.513,46.552 
2015-06-18 09:13:10.408,41.507 
2015-06-18 10:13:44.107,43.269 
2015-06-18 10:14:20.501,47.001 
2015-06-18 10:15:00.061,52.589 
2015-06-18 11:15:33.501,42.148 
2015-06-18 11:16:07.558,43.919""" 

from collections import Counter 
c = Counter(line[:13] for line in data.split('\n')) 
print c 

输出:

Counter({'2015-06-18 09': 4, '2015-06-18 10': 3, '2015-06-18 11': 2}) 
1

下使用简单的Python并没有额外的库应该工作。如果您的CSV文件非常大,您也不会尝试将整个文件加载到内存中,这也会更加合适。

sHour = "" 
nThisHour = 1 

with open('log.csv') as ff: 
    for line in ff: 
     sCurHour = line[11:13] 

     if sHour == sCurHour: 
      nThisHour += 1 
     else: 
      if sHour: 
       print nThisHour 

      nThisHour = 1 
      sHour = sCurHour 

    print nThisHour 

这给出以下输出,这将是在相同的顺序输入:如果日期也很重要

17 
3 
7 
8 

线片可以扩大。如果日志在一天中没有改变,情况就会如此。

+0

中提供了更多的理解,但时间戳的日期部分实际上是不相关的。这可能是可能的,但我怀疑这是有意的。 – Wolf

1

如果我们在同一时刻不同,这意味着考虑同样的数据:

2015-06-18 09:06:04.697,45.187 
2015-06-18 09:06:40.719,45.187 

数的两倍。

最简单的方法:

d = defaultdict(list) 
with open(file, 'r') as f: 
    for line in f.xreadlines(): 
     d[line.strip()[:13]] += 1 
+0

*'在同一个小时考虑相同的数据*'我会把这个讨论留下来,这很令人困惑,因为你不知道数据的含义,它也可能是一个事件标识符。是的,其余的很简单:-) – Wolf

+0

顺便说一句:['xreadlines'从Python 2.3开始已被弃用](https://docs.python.org/release/2.3/lib/module-xreadlines。html) – Wolf

+0

感谢提醒,即时通讯使用py2.7 – LittleQ

0

这里完整的API来处理日/时/分/ MS计数器,还与路径德日志文件。

from collections import defaultdict, Counter 
import re 
import json 

def _get(pattern, line): 
    return re.findall(pattern, line) 

def get(infile, days=False, hours=True, mils=False, min_=False, sec=False): 
    days_pattern = "\d{4}\-\d{1,2}-\d{1,2}" 
    days_hours_pattern = days_pattern + "\s?\d{1,2}" 
    days_min_pattern = days_pattern + "\s?\d{1,2}:\d{1,2}" 
    day_hours_min_s_pattern = days_pattern + "\s?\d{1,2}:\d{1,2}:\d{1,2}" 
    day_hours_min_ms_pattern = day_hours_min_s_pattern + '\.\d+,\d+' 

    result = dict() 
    result['days'] = defaultdict(list) 
    result['hours'] = defaultdict(list) 
    result['ms'] = defaultdict(list) 
    result['min'] = defaultdict(list) 
    result['sec'] = defaultdict(list) 

    with open(infile) as fh: 
     for line in fh: 
      if days: 
       for cdays in _get(days_pattern, line): 
        result['days'][cdays].append(cdays) 
      if hours: 
       for chour in _get(days_hours_pattern, line): 
        result['hours'][chour].append(chour) 
      if min_: 
       for min in _get(days_min_pattern, line): 
        result['min'][min].append(min) 
      if sec: 
       for sec in _get(day_hours_min_s_pattern, line): 
        result['sec'][sec].append(sec) 
      if mils: 
       for mils in _get(day_hours_min_ms_pattern, line): 
        result['ms'][mils].append(mils) 
    summary = dict() 
    for k in result: 
     for i in result[k]: 
      summary[i] = Counter(result[k][i]) 
    return result, summary 

fin = "./in.txt" 
result, sum = get(fin, days=True, mils=True, min_=True, hours=True, sec=True) 

# works 
sum['2015-06-18'] 
sum['2015-06-18 09'] 
sum['2015-06-18 09:04'] 
sum['2015-06-18 09:04'] 
sum['2015-06-18 09:04:04'] 
sum["2015-06-18 09:04:04.377,54954"] 
相关问题