2017-09-26 76 views
2

比方说,我们有以下数据集:熊猫:标准化组内

import pandas as pd 

data = [('apple', 'red', 155), ('apple', 'green', 102), ('apple', 'iphone', 48), 
     ('tomato', 'red', 175), ('tomato', 'ketchup', 96), ('tomato', 'gun', 12)] 

df = pd.DataFrame(data) 
df.columns = ['word', 'rel_word', 'weight'] 

enter image description here

我想重新计算权重,使他们每个组中总结到1.0(苹果,番茄在例子中)并保持相关权重(例如苹果/红色苹果/绿色仍然应该是155/102)。

+0

您可以添加所需的O-本安输出? – jezrael

+0

请在单独的专栏中提及预期输出以便更好地理解 – JKC

回答

2

您可以使用groupby计算各组的总重量,然后apply标准化lambda函数每一行:

group_weights = df.groupby('word').aggregate(sum) 
df['normalized_weights'] = df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1) 

输出:

word rel_word weight normalized_weights 
0 apple red   155  0.508197 
1 apple green  102  0.334426 
2 apple iphone  48  0.157377 
3 tomato red   175  0.618375 
4 tomato ketchup  96  0.339223 
+0

好的解决方案将命令式编程包装到Pandas思维中。谢谢! –

2

使用transform - 比apply快并查找

In [3849]: df['weight']/df.groupby('word')['weight'].transform('sum') 
Out[3849]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64 

In [3850]: df['norm_w'] = df['weight']/df.groupby('word')['weight'].transform('sum') 

In [3851]: df 
Out[3851]: 
    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403 

或者,

In [3852]: df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
Out[3852]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64 

时序

In [3862]: df.shape 
Out[3862]: (12000, 4) 

In [3864]: %timeit df['weight']/df.groupby('word')['weight'].transform('sum') 
100 loops, best of 3: 2.44 ms per loop 

In [3866]: %timeit df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
100 loops, best of 3: 5.16 ms per loop 

In [3868]: %%timeit 
     ...: group_weights = df.groupby('word').aggregate(sum) 
     ...: df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1) 
1 loop, best of 3: 2.5 s per loop 
+0

看起来更加智能和熊猫福的方式。谢谢! –

0

使用np.bincount & pd.factorize
这应该是非常快的,可扩展的

f, u = pd.factorize(df.word.values) 
w = df.weight.values 

df.assign(norm_w=w/np.bincount(f, w)[f]) 

    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403