如何根据这些行值在一列中选择熊猫的行值，以满足某些条件出现在另一列的任何地方

标题令人困惑。如何根据这些行值在一列中选择熊猫的行值，以满足某些条件出现在另一列的任何地方

因此，假设我有一个数据帧，其中有一列，即id，它在整个数据帧中出现多次。然后我有另一个专栏，我们叫它cumulativeOccurrences。

如何选择id的所有唯一匹配项，以使其他列满足某个条件，例如对于该id的每个实例而言都表示cumulativeOccurrences > 20？

代码的开始可能是这样的：

dataframe.groupby('id')

但我想不通的休息。

下面是一个简单的小数据集应返回零个值：

id   cumulativeOccurrences 
5494178  136 
5494178  71 
5494178  18 
5494178  83 
5494178  57 
5494178  181 
5494178  13 
5494178  10 
5494178  90 
5494178  4484

好了，这是我更得过且过左右后得到的结果：

res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
ids = res[res.cumulativeOccurrences['<lambda>']==True].index

这给了我ID的列表满足条件。不过，对于agg函数，可能有比列表理解lambda函数更好的方法。有任何想法吗？

来源

2017-10-28 Jeremy Schutte

你可以添加一些数据样本和所需的输出吗？ – jezrael

第一过滤器，然后使用DataFrameGroupBy.all：

res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
ids = res.index[res] 
print (ids) 
Int64Index([5494172], dtype='int64', name='id')

EDIT1：

首先定时非排序id和第二对分选的。

np.random.seed(123) 
N = 10000000 

df = pd.DataFrame({'id': np.random.randint(1000, size=N), 
        'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
        columns=['id','cumulativeOccurrences']) 
print (df.head())

In [125]: %%timeit 
    ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 1.22 s per loop 

In [126]: %%timeit 
    ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
    ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index 
    ...: 
1 loop, best of 3: 3.69 s per loop 

In [127]: %timeit 

In [128]: %%timeit 
    ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 3.63 s per loop

np.random.seed(123) 
N = 10000000 

df = pd.DataFrame({'id': np.random.randint(1000, size=N), 
        'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
        columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True) 
print (df.head())

In [130]: %%timeit 
    ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 795 ms per loop 

In [131]: %%timeit 
    ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
    ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index 
    ...: 
1 loop, best of 3: 3.23 s per loop 

In [132]: %%timeit 
    ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 3.15 s per loop

结论 - 排序id和独特的索引可以提高性能。还有数据在版本python 3下测试。

来源

2017-10-28 17:08:42 jezrael

此筛选器对至少有一个cumulativeOccurence的值超过20的ID进行过滤。我试图将其过滤为使特定ID的所有cumulativeOccurence都超过20. –

感谢您提供数据，我编辑答案。 – jezrael

嘿 - 这个作品，我会选择它作为答案，谢谢。我想知道是否有什么办法可以更快地做到这一点，因为它对于我的数据集（大约4200万）来说非常慢。 –

如何根据这些行值在一列中选择熊猫的行值，以满足某些条件出现在另一列的任何地方

回答

相关问题