2017-04-02 109 views
3

新来大熊猫任何帮助表示赞赏提取特定的数据

IO [ '有效载荷']的

Snapshot of the dataset

def csv_reader(fileName): 
    reqcols=['_id__$oid','payload','channel'] 
    io = pd.read_csv(fileName,sep=",",usecols=reqcols) 
    print(io['payload'].values) 
    return io 

输出行:

{ 
    "destination_ip": "172.31.14.66", 
    "date": "2014-10-19T01:32:36.669861", 
    "classification": "Potentially Bad Traffic", 
    "proto": "UDP", 
    "source_ip": "172.31.0.2", 
    "priority": "`2", 
    "header": "1:2003195:5", 
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ", 
    "source_port": "53", 
    "destination_port": "34638", 
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e" 
} 

我试图提取来自ndarray对象的特定数据。什么是可用于从数据帧提取的方法

"destination_ip": "172.31.13.124", 
"proto": "ICMP", 
"source_ip": "201.158.32.1", 
"date": "2014-09-28T14:49:43.391463", 
"sensor": "139cfdf2-471e-11e4-9ee4-0a0b6e7c3e9e" 
+0

向我们展示你的输入数据的样本。 –

+0

@JohnZwinck请检查更新的问题 – user1208523

回答

2

我想你需要先string reperesentation的dicts在​​列转换为dictionaries每排由json.loadsast.literal_eval,然后创建通过构造新DataFrame,由子集筛选列,并在必要时通过concat添加原始列:

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']} 
reqcols=['_id__$oid','payload','channel'] 
df = pd.DataFrame(d) 
print (df) 
    _id__$oid  channel           payload 
0  542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014... 
1  542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014... 
2  542f8 snort_alert {"destination_ip":"172.31.14.66","date": "2014... 

import json 
import ast 
df.payload = df.payload.apply(json.loads) 
#another slowier solution 
#df.payload = df.payload.apply(ast.literal_eval) 

required = ["destination_ip", "proto", "source_ip", "date", "sensor"] 
df1 = pd.DataFrame(df.payload.values.tolist())[required] 
print (df1) 
    destination_ip proto source_ip      date \ 
0 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861 
1 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861 
2 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861 

           sensor 
0 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
1 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
2 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 

df2 = pd.concat([df[['_id__$oid','channel']], df1], axis=1) 
print (df2) 
    _id__$oid  channel destination_ip proto source_ip \ 
0  542f8 snort_alert 172.31.14.66 UDP 172.31.0.2 
1  542f8 snort_alert 172.31.14.66 UDP 172.31.0.2 
2  542f8 snort_alert 172.31.14.66 UDP 172.31.0.2 

         date        sensor 
0 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
1 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
2 2014-10-19T01:32:36.669861 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 

时序

#[30000 rows x 3 columns] 
df = pd.concat([df]*10000).reset_index(drop=True) 
print (df) 

In [38]: %timeit pd.DataFrame(df.payload.apply(json.loads).values.tolist())[required] 
1 loop, best of 3: 379 ms per loop 

In [39]: %timeit pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[required] 
1 loop, best of 3: 528 ms per loop 

In [40]: %timeit pd.DataFrame(df.payload.apply(ast.literal_eval).values.tolist())[required] 
1 loop, best of 3: 1.98 s per loop 
+0

感谢解决方案 – user1208523

0

访问熊猫列是非常直接的。简单地传递列的列表,你需要:

代码:

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"] 
extracted_data = df[columns] 

测试代码:

data = { 
    "destination_ip": "172.31.14.66", 
    "date": "2014-10-19T01:32:36.669861", 
    "classification": "Potentially Bad Traffic", 
    "proto": "UDP", 
    "source_ip": "172.31.0.2", 
    "priority": "`2", 
    "header": "1:2003195:5", 
    "signature": "ET POLICY Unusual number of DNS No Such Name Responses ", 
    "source_port": "53", 
    "destination_port": "34638", 
    "sensor": "5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e" 
} 
df = pd.DataFrame([data, data]) 

columns = ["destination_ip", "proto", "source_ip", "date", "sensor"] 
print(df[columns]) 

结果:

destination_ip proto source_ip      date \ 
0 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861 
1 172.31.14.66 UDP 172.31.0.2 2014-10-19T01:32:36.669861 

           sensor 
0 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
1 5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e 
0

问题是​​是CSV输入数据的一列,它是一个JSON字符串。所以你首先可以像解析整个文件一样使用read_csv(),但是你需要解析每个JSON对象。让我们用这个例子的数据:

payload = pd.Series(['{"a":1, "b":2}', '{"b":4, "c":5}']) 

现在做一个单一的JSON字符串:

json = ','.join(payload).join('[]') 

其中给出:

'[{"a":1, "b":2}, {"b":4, "c":5}]' 

然后对其进行分析:

pd.read_json(json) 

要获得:

 a b c 
0 1.0 2 NaN 
1 NaN 4 5.0 
1

使用@ jezrael的样品df

d = {'_id__$oid': ['542f8', '542f8', '542f8'], 'channel': ['snort_alert', 'snort_alert', 'snort_alert'], 'payload': ['{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}', '{"destination_ip":"172.31.14.66","date": "2014-10-19T01:32:36.669861","classification":"Potentially Bad Traffic","proto":"UDP","source_ip":"172.31.0.2","priority":"2","header":"1:2003195:5","signature":"ET POLICY Unusual number of DNS No Such Name Responses ","source_port":"53","destination_port":"34638","sensor":"5cda4a12-4730-11e4-9ee4-0a0b6e7c3e9e"}']} 
df = pd.DataFrame(d) 

解决方案

  • 粉碎一切​​s的一个vecorized str.cat
  • 01一起
  • 解析整个事情一旦与pd.read_json

cols = 'destination_ip proto source_ip date sensor'.split() 
df.drop(
    'payload', 1 
).join(
    pd.read_json('[{}]'.format(df.payload.str.cat(sep=',')))[cols] 
) 

enter image description here

+0

有趣的是,我认为你的解决方案更快,但不是。 – jezrael

+0

超过3行?或者更多? – piRSquared

+0

检查我的答案。 – jezrael