2016-08-03 79 views
1

这里是我的文本文件组由多个列和格式结果在大熊猫

No.,Time,Source,Destination,Protocol,Length,Info,SrcPort,DstPort,src_dst_pair 
1401,0.397114,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
8999,3.229111,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
18504,5.877098,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
23755,8.695843,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
28027,11.24121,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
33304,14.117213,145.95.225.186,210.218.218.164,UDP,100,Source port: hsrp Destination port: hsrp,hsrp,1985,"('145.95.225.186', '210.218.218.164')" 
700443,222.305789,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 
700495,222.351933,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 
700496,222.352372,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 
708982,225.913385,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 
709797,226.130847,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 
710340,226.372421,145.95.41.251,145.95.81.118,UDP,50,Source port: 36477 Destination port: snmp,36477,161,"('145.95.41.251', '145.95.81.118')" 

我想组基于源和目的地,然后将数据:

  1. 积累中的长度列组

  2. 查找组内最大值和最小值之间的差值

我得到了结果,但我需要按照我在预期输出中显示的方式进行格式化。我也想知道是否有更好的方法来做到这一点。

下面是我尝试

import pandas as pd 

data = pd.read_csv('simple_udp.csv') 
# getting the accumulated sum for the group 
length = data.groupby(['Source','Destination']).Length.sum() 
# getting the difference in time between the max and min in the group 
time = data.groupby(['Source','Destination']).Time.max() - data.groupby(['Source','Destination']).Time.min() 
# This is were I have problem. How can i format the result so that 
# I can get the expected output(shown below) 
print length, time 

预计输出

Source   Destination  Length Time 
145.95.225.186 210.218.218.164 600 13.720099 
145.95.41.251 145.95.81.118  300  4.066632 

回答

2

使用agg

data.groupby(['Source','Destination']).agg({'Length': 'sum', 'Time': lambda x: x.max() - x.min()}) 

enter image description here

0

我的第一个猜测是

import pandas as pd 
data = pd.read_csv('simple_udp.csv') 
# Creating a DataFramGroupBy object 
group = data.groupby(['Source','Destination']) 
df_length = g['Length'].sum() 
df_time = g['Time'].max() - g['Time'].min() 
df = pd.DataFrame([df_length,df_time]) 

,或者如果你想拥有它不太行,但也不易阅读使用agg的方法group