2016-06-09 36 views
0

我试图找出最大(First_Word, Group)大熊猫的GroupBy两个文本列,返回基于计数的最大行数

import pandas as pd 

df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
      'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
      'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
      columns=['First_Word', 'Group', 'Text']) 

    First_Word   Group      Text 
0  apple apple bins  where to buy apple bins 
1  apple apple trees   i see an apple tree 
2  orange orange juice   i like orange juice 
3  apple apple trees apple fell out of the tree 
4  pear  pear tree  partrige in a pear tree 

然后我做了groupby

grouped = df.groupby(['First_Word', 'Group']).count() 
         Text 
First_Word Group    
apple  apple bins  1 
      apple trees  2 
orange  orange juice  1 
pear  pear tree  1 
现在

我希望将其过滤为仅具有最大Text计数的唯一索引行。下面您会注意到apple bins已被移除,因为apple trees具有最大值。

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1 

max value of group问题是类似的,但是当我尝试这样的事:

df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()]) 

我得到一个错误:KeyError: ('Text', 'occurred at index Text')。如果我添加axis=1apply我得到IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')

回答

2

鉴于grouped,你现在要由First Word指数级组,并找到最大行的索引标签为每个组(使用idxmax):

In [39]: grouped.groupby(level='First_Word')['Text'].idxmax() 
Out[39]: 
First_Word 
apple  (apple, apple trees) 
orange (orange, orange juice) 
pear   (pear, pear tree) 
Name: Text, dtype: object 

然后,您可以使用grouped.loc通过索引标签选择grouped行:

import pandas as pd 
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
    'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
    'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
    columns=['First_Word', 'Group', 'Text']) 

grouped = df.groupby(['First_Word', 'Group']).count() 
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()] 
print(result) 

产量

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1