大熊猫的GroupBy两个文本列，返回基于计数的最大行数

我试图找出最大(First_Word, Group)对大熊猫的GroupBy两个文本列，返回基于计数的最大行数

import pandas as pd 

df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
      'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
      'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
      columns=['First_Word', 'Group', 'Text']) 

    First_Word   Group      Text 
0  apple apple bins  where to buy apple bins 
1  apple apple trees   i see an apple tree 
2  orange orange juice   i like orange juice 
3  apple apple trees apple fell out of the tree 
4  pear  pear tree  partrige in a pear tree

然后我做了groupby：

grouped = df.groupby(['First_Word', 'Group']).count() 
         Text 
First_Word Group    
apple  apple bins  1 
      apple trees  2 
orange  orange juice  1 
pear  pear tree  1

现在

我希望将其过滤为仅具有最大Text计数的唯一索引行。下面您会注意到apple bins已被移除，因为apple trees具有最大值。

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1

这max value of group问题是类似的，但是当我尝试这样的事：

df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()])

我得到一个错误：KeyError: ('Text', 'occurred at index Text')。如果我添加axis=1到apply我得到IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')

来源

2016-06-09 Jarad

鉴于grouped，你现在要由First Word指数级组，并找到最大行的索引标签为每个组（使用idxmax）：

In [39]: grouped.groupby(level='First_Word')['Text'].idxmax() 
Out[39]: 
First_Word 
apple  (apple, apple trees) 
orange (orange, orange juice) 
pear   (pear, pear tree) 
Name: Text, dtype: object

然后，您可以使用grouped.loc通过索引标签选择grouped行：

import pandas as pd 
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'], 
    'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'], 
    'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice', 
       'apple fell out of the tree', 'partrige in a pear tree']}, 
    columns=['First_Word', 'Group', 'Text']) 

grouped = df.groupby(['First_Word', 'Group']).count() 
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()] 
print(result)

产量

      Text 
First_Word Group    
apple  apple trees  2 
orange  orange juice  1 
pear  pear tree  1

来源

2016-06-09 21:52:25 unutbu

大熊猫的GroupBy两个文本列，返回基于计数的最大行数

回答

相关问题