2016-08-25 72 views
2

我有这样的列一个数据帧:列出熊猫集团最常见的会员?

 id       lead_sponsor lead_sponsor_class 
    02837692 Janssen Research & Development, LLC   Industry 
    02837679    Aarhus University Hospital    Other 
    02837666 Universidad Autonoma de Ciudad Juarez    Other 
    02837653   Universidad Autonoma de Madrid    Other 
    02837640   Beirut Eye Specialist Hospital    Other 

我想找到最常见的主要发起者。我可以列出使用每个组的大小:

df.groupby(['lead_sponsor', 'lead_sponsor_class']).size() 

,给了我这样的:

lead_sponsor        lead_sponsor_class 
307 Hospital of PLA      Other     1 
3E Therapeutics Corporation    Industry    1 
3M          Industry    4 
4SC AG         Industry    8 
5 Santé         Other     1 

但我怎么找到顶级的10种最常见的群体?如果我做的:

df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().sort_values(ascending=False).head(10) 

然后我得到一个错误:

AttributeError: 'Series' object has no attribute 'sort_values'

+0

对我而言,您的解决方案也适用。 – jezrael

回答

2

我认为你可以使用Series.nlargest

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().nlargest(10)) 

docs注意

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

样品:

import pandas as pd 

df = pd.DataFrame({'id': {0: 2837692, 1: 2837679, 2: 2837666, 3: 2837653, 4: 2837640}, 
        'lead_sponsor': {0: 'a', 1: 'a', 2: 'a', 3: 's', 4: 's'}, 
        'lead_sponsor_class': {0: 'Industry', 1: 'Other', 2: 'Other', 3: 'Other', 4: 'Other'}}) 

print (df) 
     id lead_sponsor lead_sponsor_class 
0 2837692   a   Industry 
1 2837679   a    Other 
2 2837666   a    Other 
3 2837653   s    Other 
4 2837640   s    Other 

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size()) 
lead_sponsor lead_sponsor_class 
a    Industry    1 
       Other     2 
s    Other     2 
dtype: int64 

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().sort_values(ascending=False).head(2)) 
lead_sponsor lead_sponsor_class 
s    Other     2 
a    Other     2 
dtype: int64 

print (df.groupby(['lead_sponsor', 'lead_sponsor_class']).size().nlargest(2)) 
lead_sponsor lead_sponsor_class 
a    Other     2 
s    Other     2 
dtype: int64 
+0

是的!谢谢! – Richard

+0

就这样我明白这一点 - 是调用'.size()'系列的结果吗?我觉得我很困惑,因为它看起来像一个数据框,而不是一个系列(它向左侧打印两列的方式)。 – Richard

+0

是的,它是'系列'。你可以用'print(type(df.groupby(['lead_sponsor','lead_sponsor_class']).size()))来测试它'' '' – jezrael