Python：逐行替换数据帧值

我想在pandas数据框中逐行条件替换值，以便max（row）将保留，而行中的所有其他值将被设置为None 。我的直觉走向apply()，但我不确定这是正确的选择，还是如何去做。Python：逐行替换数据帧值

实施例（但可能有多个列）：

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 

tmp 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10

求购输出：

somemagic(tmp) 
    A  B 
0 None 3 
1 None 4 
2 3  None 
3 None 33 
4 None 10 
5 None 9 
6 7  None # on tie I don't really care which one is set to None 
7 8  None 
8 None 10 
9 10  None # on tie I don't really care which one is set to None

，关于如何实现这一任何建议？

来源

2016-05-30 Ruslan

您可以比较DataFrame与max值由eq：

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True True 
7 True False 
8 False True 
9 True True 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 7.0 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 10.0

，然后你可以添加NaN如果列中的值相等：

mask = (tmp.eq(tmp.max(axis=1), axis=0)) 
mask['B'] = mask.B & (tmp.A != tmp.B) 
print (mask) 
     A  B 
0 False True 
1 False True 
2 True False 
3 False True 
4 False True 
5 False True 
6 True False 
7 True False 
8 False True 
9 True False 

df = (tmp[mask]) 
print (df) 
     A  B 
0 NaN 3.0 
1 NaN 4.0 
2 3.0 NaN 
3 NaN 33.0 
4 NaN 10.0 
5 NaN 9.0 
6 7.0 NaN 
7 8.0 NaN 
8 NaN 10.0 
9 10.0 NaN

计时（len(df)=10）：

In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
1000 loops, best of 3: 974 µs per loop 

In [235]: %timeit (gh(tmp)) 
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.64 ms per loop

（len(df)=100k）：

In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 
100 loops, best of 3: 7.42 ms per loop 

In [245]: %timeit (gh(t1)) 
1 loop, best of 3: 8.81 s per loop

代码时序：

import pandas as pd 

tmp= pd.DataFrame({ 
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
}) 


tmp = pd.concat([tmp]*10000).reset_index(drop=True) 
t1 = tmp.copy() 

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)]) 


def top(row): 
    data = row.tolist() 
    return [d if d == max(data) else None for d in data] 

def gh(tmp1): 
    return tmp1.apply(top, axis=1) 

print (gh(t1))

来源

2016-05-30 11:39:13 jezrael

我曾在我的脑海里完全一样：'TMP [tmp.eq（tmp.max（轴= 1），轴= 0）]':) –

谢谢你们！非常感激！ :) – Ruslan

我会建议你使用apply()。您可以如下使用它：

In [1]: import pandas as pd 

In [2]: tmp= pd.DataFrame({ 
    ...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)), 
    ...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10)) 
    ...: }) 

In [3]: tmp 
Out[3]: 
    A B 
0 1 3 
1 2 4 
2 3 1 
3 4 33 
4 5 10 
5 6 9 
6 7 7 
7 8 3 
8 9 10 
9 10 10 

In [4]: def top(row): 
    ...:   data = row.tolist() 
    ...:   return [d if d == max(data) else None for d in data] 
    ...: 

In [5]: df2 = tmp.apply(top, axis=1) 

In [6]: df2 
Out[6]: 
    A B 
0 NaN 3 
1 NaN 4 
2 3 NaN 
3 NaN 33 
4 NaN 10 
5 NaN 9 
6 7 7 
7 8 NaN 
8 NaN 10 
9 10 10

来源

2016-05-30 11:51:36

谢谢！这正是*我正在寻找的东西。 *和*可以在任意数量的列上工作:) – Ruslan

我认为使用'apply'对于矢量化解决方案来说比较慢，请参阅我的解决方案中的时序。 – jezrael

Python：逐行替换数据帧值

回答

相关问题