2013-02-25 156 views
3

我想用布尔值创建一个DataFrame,其中np.nan == False和任何正实数值== True。返回布尔值DataFrame

import numpy as np 
import pandas as pd 
DF = pd.DataFrame({'a':[1,2,3,4,np.nan],'b':[np.nan,np.nan,np.nan,5,np.nan]}) 

DF.apply(bool) # Does not work 
DF.where(DF.isnull() == False) # Does not work 
DF[DF.isnull() == False] # Does not work 

回答

2

怪异,但它看起来像- np.isnan(df)以压倒性的优势胜过pd.notnull(df)

In [1]: import pandas as pd 

In [2]: import numpy as np 

In [3]: df = pd.DataFrame({'a':[1,2,3,4,np.nan],'b':[np.nan,np.nan,np.nan,5,np.nan]}) 


In [4]: - np.isnan(df) 
Out[4]: 
     a  b 
0 True False 
1 True False 
2 True False 
3 True True 
4 False False 

In [5]: %timeit - np.isnan(df) 
10000 loops, best of 3: 159 us per loop 

In [6]: %timeit pd.notnull(df) 
1000 loops, best of 3: 1.22 ms per loop 
2

有不isnull一个方便的功能,称为notnull

In [11]: pd.notnull(df) 
Out[11]: 
     a  b 
0 True False 
1 True False 
2 True False 
3 True True 
4 False False 
+1

+1注意到了'notnull'。但是,'np.isnan(df)'似乎快了8倍:S – root 2013-02-25 14:37:15

+0

@root有趣!我怀疑这是部分/主要是因为'notnull'比'float'支持更多'dtypes'? – 2013-02-25 14:42:36

0

比较NOTNULL()和isnan()在某些格式错误的df上:

df = pd.DataFrame({'a':[1,2,3,4,np.nan],'b':[np.nan,np.nan,np.nan,5,np.nan],'c':['fish','bear','cat','dog',np.nan]}) 

%%timeit 
legit_dexes = np.isnan(df[df<=""].astype(float)) == False 

1000个循环,最好的3:632我们每个环路

%%timeit 
legit_dexes = pd.notnull(df) 

1000个循环,最好的3:751我们每个环路

这种变化,无视畸形列也类似:

%%timeit 
legit_dexes = np.isnan(df[df.columns[df.apply(lambda x: not np.any(x.values>=""))]]) == False 

1000次循环,最好的3:681我们每个环路