2017-02-27 102 views
3

我试图将具有三列(日期,开始,结束)的熊猫数据帧转换为频率矩阵。我的输入数据帧是这样的:将熊猫数据帧转换为频率矩阵

Date,    Start, End 
2016-09-02 09:16:00 18  16 
2016-09-02 16:14:10 16  1 
2016-09-02 06:17:21 18  17 
2016-09-02 05:51:07 23  17 
2016-09-02 18:34:44 18  17 
2016-09-02 05:44:44 20  4 
2016-09-02 09:25:22 18  17 
2016-09-02 22:27:44 18  17 
2016-09-02 16:02:46 0  18 
2016-09-02 15:35:07 17  17 
2016-09-02 16:06:42 8  17 
2016-09-02 14:47:04 16  23 
2016-09-02 07:47:24 20  1 
... 

“开始”和“结束”的值是023之间的整数。 '日期'是一个日期时间。我试图创建的频率矩阵是24乘24 csv,其中行i和列j是'End'= i和'Start'= j发生在输入中的次数。例如,上述数据将创建:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0 
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 
5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 
17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 1 
18, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 

额外的帮助,可这在创造了每15分钟一个单独的矩阵的方式来完成?这将是672个矩阵,因为这个日期范围是一周。 我是一个自学成才的初学者,我真的无法想象如何以pythonic的方式解决这个问题,任何解决方案或建议将不胜感激。

回答

5

用一个简单的计数创建矩阵,拆散一列中的一种:

mat = df.groupby(['Start', 'End']).count().unstack(level=0) 

清理日期级别:

mat.columns = mat.columns.droplevel(0) 

现在重新索引的行和列,并浇铸成整数:

mat.reindex(*[range(0,24)]*2).fillna(0) 

详细解释

首先,你计算一个给定(开始,结束)夫妇出现的次数。 groupby针对这两列的结果实际上带来了一个多重索引。

df.groupby(['Start', 'End']).count() 
Out[134]: 
      Date 
Start End  
0  18  1 
8  17  1 
16 1  1 
     23  1 
17 17  1 
18 16  1 
     17  4 
20 1  1 
     4  1 
23 17  1 

我们希望从结果中得到列索引。拆散执行此:

df.groupby(['Start', 'End']).count().unstack(level=0) 
Out[135]: 
     Date        
Start 0 8 16 17 18 20 23 
End          
1  NaN NaN 1.0 NaN NaN 1.0 NaN 
4  NaN NaN NaN NaN NaN 1.0 NaN 
16  NaN NaN NaN NaN 1.0 NaN NaN 
17  NaN 1.0 NaN 1.0 4.0 NaN 1.0 
18  1.0 NaN NaN NaN NaN NaN NaN 
23  NaN NaN 1.0 NaN NaN NaN NaN 

的拆散的结果是被移动作为关于当前日期的列索引的顶部上的附加的列索引水平开始柱(见下文)。这就是为什么我们之后放下0级的原因。另一种方法 - 取决于你当前的源代码 - 可能是预先过滤出日期列,然后拆散会带来一个级别。

_.columns 
Out[136]: 
MultiIndex(levels=[['Date'], [0, 8, 16, 17, 18, 20, 23]], 
      labels=[[0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6]], 
      names=[None, 'Start']) 
+0

使用'reindex'的好方案! – pansen

+0

谢谢!它的作品,但我有点失落至于如何。你能解释一下斯达克的作用吗? –

+1

Unstack转换表格,将列转成一行。 – postoronnim