在特定列中提取值，并在大熊猫

以下处理其它列中的值是我的数据帧的简化团块。我想处理在特定列中提取值，并在大熊猫

No.,Time,Source,Destination,Protocol,Length,Info,src_dst_pair 
325778,112.305107,02:e0,Broadcast,ARP,64,Who has 253.244.230.77? Tell 253.244.230.67,"('02:e0', 'Broadcast')" 
801130,261.868118,02:e0,Broadcast,ARP,64,Who has 253.244.230.156? Tell 253.244.230.67,"('02:e0', 'Broadcast')" 
700094,222.055094,02:e0,Broadcast,ARP,60,Who has 253.244.230.77? Tell 253.244.230.156,"('02:e0', 'Broadcast')" 
766543,247.796156,100.118.138.150,41.177.26.176,TCP,66,32222 > http [SYN] Seq=0,"('100.118.138.150', '41.177.26.176')" 
767405,248.073313,100.118.138.150,41.177.26.176,TCP,64,32222 > http [ACK] Seq=1,"('100.118.138.150', '41.177.26.176')" 
767466,248.083268,100.118.138.150,41.177.26.176,HTTP,380,Continuation [Packet capture],"('100.118.138.150', '41.177.26.176')" 
891394,294.989813,105.144.38.121,41.177.26.15,TCP,66,48852 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.121', '41.177.26.15')" 
892285,295.320654,105.144.38.121,41.177.26.15,TCP,64,48852 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.121', '41.177.26.15')" 
892287,295.321003,105.144.38.121,41.177.26.15,HTTP,350,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.121', '41.177.26.15')" 
893306,295.652079,105.144.38.121,41.177.26.15,TCP,64,48852 > http [ACK] Seq=293 Ack=609 Win=64928 Len=0,"('105.144.38.121', '41.177.26.15')" 
893307,295.652233,105.144.38.121,41.177.26.15,TCP,64,"48852 > http [FIN, ACK] Seq=293 Ack=609 Win=64928 Len=0","('105.144.38.121', '41.177.26.15')" 
885501,294.070377,105.144.38.139,41.177.26.15,TCP,66,48810 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.139', '41.177.26.15')" 
887786,294.402349,105.144.38.139,41.177.26.15,TCP,64,48810 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.139', '41.177.26.15')" 
887788,294.402642,105.144.38.139,41.177.26.15,HTTP,371,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.139', '41.177.26.15')" 
890133,294.732297,105.144.38.139,41.177.26.15,TCP,64,"48810 > http [FIN, ACK] Seq=314 Ack=629 Win=64907 Len=0","('105.144.38.139', '41.177.26.15')" 
890154,294.733413,105.144.38.139,41.177.26.15,TCP,64,48810 > http [ACK] Seq=315 Ack=630 Win=64907 Len=0,"('105.144.38.139', '41.177.26.15')" 
902758,297.792645,105.144.38.164,41.177.26.15,TCP,66,49005 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.164', '41.177.26.15')" 
903926,298.123157,105.144.38.164,41.177.26.15,TCP,64,49005 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.164', '41.177.26.15')" 
903932,298.123369,105.144.38.164,41.177.26.15,HTTP,350,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.164', '41.177.26.15')" 
905269,298.455368,105.144.38.164,41.177.26.15,TCP,64,49005 > http [ACK] Seq=293 Ack=609 Win=64928 Len=0,"('105.144.38.164', '41.177.26.15')" 
905273,298.455557,105.144.38.164,41.177.26.15,TCP,64,"49005 > http [FIN, ACK] Seq=293 Ack=609 Win=64928 Len=0","('105.144.38.164', '41.177.26.15')" 
906162,298.714281,105.144.38.204,41.177.26.15,TCP,66,49050 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.204', '41.177.26.15')" 
907292,299.025951,105.144.38.204,41.177.26.15,TCP,64,49050 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.204', '41.177.26.15')" 
907294,299.026985,105.144.38.204,41.177.26.15,HTTP,354,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.204', '41.177.26.15')" 
907811,299.362918,105.144.38.204,41.177.26.15,TCP,64,49050 > http [ACK] Seq=297 Ack=613 Win=64924 Len=0,"('105.144.38.204', '41.177.26.15')" 
907812,299.362951,105.144.38.204,41.177.26.15,TCP,64,"49050 > http [FIN, ACK] Seq=297 Ack=613 Win=64924 Len=0","('105.144.38.204', '41.177.26.15')"

我怎样才能做到在大熊猫以下？对于每一个独特的df.src_dst_pair（每行中的最后一个元素）：

检查df.Info有[SYN]。如果不是，则跳过该行。
如果df.Info有[SYN]，存储df.Time（表示开始时间）
开始累计从[SYN]的df.Length，直到我们找到[FIN, ACK]
一旦我们发现df.info的[FIN, ACK]，存储df.Time（指示停止时间）。如果df.Info发现了df.src_dst_pair没有[FIN, ACK]，然后跳过df.src_dst_pair。
最后，总结的结果。

df.src_dst_pair: flow number, (accumulated) df.Length, df.Time(stop)-df.Time(start)

预期输出first.csv

('105.144.38.121', '41.177.26.15') : flow 1, 1118, 0.66242 
('105.144.38.139', '41.177.26.15') : flow 1, 565, 0.028527 
('105.144.38.139', '41.177.26.15') : flow 2, 608, 0.662912 
('105.144.38.204', '41.177.26.15') : flow 1, 612, 0.64867

我的方法：

import pandas 
import numpy 


data = pandas.read_csv('first.csv') 
print data 

uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel()) 
print uniq_src_dst_pair 
print len(uniq_src_dst_pair) 

# for now only able to sort data based on src_dst_pair, need flow info. 
result = data.groupby('src_dst_pair').Length.sum() 
print result

来源

2016-07-28 user2532296

import pandas as pd 


def extract_flows(g): 
    # Find the location of SYN packets 
    is_syn = g['Info'].fillna('').str.contains('\[SYN\]') 
    syn = g[is_syn].index.values 

    # Find the location of the FIN-ACK packets 
    is_finack = g['Info'].fillna('').str.contains('\[FIN, ACK\]') 
    finack = g[is_finack].index.values 

    # Loop over SYN packets 
    runs = [] 
    for num, start in enumerate(syn, start=1): 
     try: 
      # Find the first FIN-ACK packet after each SYN packet 
      #  If none, raises IndexError 
      stop = finack[finack > start][0] 
      runs.append([# The flow number counter 
         num, 
         # The time difference between the packets 
         g.loc[stop, 'Time'] - g.loc[start, 'Time'], 
         # The accumulated length 
         g.loc[start:stop, 'Length'].sum()]) 
     except IndexError: 
      break 

    # The output must be a DataFrame 
    output = (pd.DataFrame(runs, columns=['Flow number', 'Time', 'Length']) 
       .set_index('Flow number')) 
    return output 


df = pd.read_csv('first.csv', usecols = ['src_dst_pair', 'Info', 'Time', 'Length']) 

result = df.groupby('src_dst_pair').apply(extract_flows) 
print(result)

输出：

            Time Length 
src_dst_pair      Flow number     
('105.144.38.121', '41.177.26.15') 1   0.662420 608.0 
('105.144.38.139', '41.177.26.15') 1   0.661920 565.0 
            2   0.662912 608.0 
('105.144.38.204', '41.177.26.15') 1   0.648670 612.0

N.B.：OP中的样本数据与链接first.csv中的样本数据不一致。一些在上面的输出数字与OP的为first.csv —还有一些不同的处理所需的输出一致，我认为我是正确的。

来源

2016-07-28 17:46:00

谢谢，我要检查，让你知道。 – user2532296

嗨阿尔贝托，我试过你的代码。它适用于较小的样品。但是当我尝试使用以下数据集时（https://github.com/ecenm/data/blob/master/sorted.csv.zip）。它给出了一个'ValueError异常：不能索引与包含在'文件 “first.py” NA/NaN的values'矢量37行，在结果= df.groupby（ 'src_dst_pair'）应用（extract_flows）' – user2532296

的你链接到的文件在'Info'列中包含'NaN'（你的示例没有），这会将'is_syn'和'is_finack'强制为'float'dtype而不是'bool'。用空字符串填充这些'NaN'可以修复它---见编辑。 –

在特定列中提取值，并在大熊猫

回答

相关问题