熊猫：标准化组内

比方说，我们有以下数据集：熊猫：标准化组内

import pandas as pd 

data = [('apple', 'red', 155), ('apple', 'green', 102), ('apple', 'iphone', 48), 
     ('tomato', 'red', 175), ('tomato', 'ketchup', 96), ('tomato', 'gun', 12)] 

df = pd.DataFrame(data) 
df.columns = ['word', 'rel_word', 'weight']

我想重新计算权重，使他们每个组中总结到1.0（苹果，番茄在例子中）并保持相关权重（例如苹果/红色苹果/绿色仍然应该是155/102）。

来源

2017-09-26 Denis Kulagin

您可以添加所需的O-本安输出？ – jezrael

请在单独的专栏中提及预期输出以便更好地理解 – JKC

您可以使用groupby计算各组的总重量，然后apply标准化lambda函数每一行：

group_weights = df.groupby('word').aggregate(sum) 
df['normalized_weights'] = df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1)

输出：

word rel_word weight normalized_weights 
0 apple red   155  0.508197 
1 apple green  102  0.334426 
2 apple iphone  48  0.157377 
3 tomato red   175  0.618375 
4 tomato ketchup  96  0.339223

来源

2017-09-26 07:07:59 adrienctx

好的解决方案将命令式编程包装到Pandas思维中。谢谢！ –

使用transform - 比apply快并查找

In [3849]: df['weight']/df.groupby('word')['weight'].transform('sum') 
Out[3849]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64 

In [3850]: df['norm_w'] = df['weight']/df.groupby('word')['weight'].transform('sum') 

In [3851]: df 
Out[3851]: 
    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403

或者，

In [3852]: df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
Out[3852]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64

时序

In [3862]: df.shape 
Out[3862]: (12000, 4) 

In [3864]: %timeit df['weight']/df.groupby('word')['weight'].transform('sum') 
100 loops, best of 3: 2.44 ms per loop 

In [3866]: %timeit df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
100 loops, best of 3: 5.16 ms per loop 

In [3868]: %%timeit 
     ...: group_weights = df.groupby('word').aggregate(sum) 
     ...: df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1) 
1 loop, best of 3: 2.5 s per loop

来源

2017-09-26 08:03:52 Zero

看起来更加智能和熊猫福的方式。谢谢！ –

使用np.bincount & pd.factorize
这应该是非常快的，可扩展的

f, u = pd.factorize(df.word.values) 
w = df.weight.values 

df.assign(norm_w=w/np.bincount(f, w)[f]) 

    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403

来源

2017-09-26 08:13:19 piRSquared

熊猫：标准化组内

回答

相关问题