2016-11-06 50 views
2

我有一个数据框,用于形成一个文件,通过这个文件我按两列分组,这些列返回一个聚合计数。现在,我想最大的计数值进行排序,但是我得到以下错误:使用大熊猫进行计数和排序

KeyError: 'count'

看起来由AGG数列中的组是某种指数的所以不知道如何做到这一点,我是一个初学者到Python和熊猫。 下面是实际的代码,请让我知道如果你需要更多的细节:

def answer_five(): 
    df = census_df#.set_index(['STNAME']) 
    df = df[df['SUMLEV'] == 50] 
    df = df[['STNAME','CTYNAME']].groupby(['STNAME']).agg(['count']).sort(['count']) 
    #df.set_index(['count']) 
    print(df.index) 
    # get sorted count max item 
    return df.head(5) 

回答

10

我想你需要添加reset_index,然后参数ascending=Falsesort_values因为sort回报:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....) .sort_values(['count'], ascending=False)

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \ 
          .count() \ 
          .reset_index(name='count') \ 
          .sort_values(['count'], ascending=False) \ 
          .head(5) 

样品:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'), 
        'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]}) 

print (df) 
    CTYNAME STNAME 
0   4  a 
1   5  b 
2   6  s 
3   5  c 
4   6  s 
5   2  c 
6   3  b 
7   4  c 
8   5  d 
9   6  b 
10  4  c 
11  5  s 
12  4  s 
13  3  c 
14  6  a 
15  5  e 

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] \ 
          .count() \ 
          .reset_index(name='count') \ 
          .sort_values(['count'], ascending=False) \ 
          .head(5) 

print (df) 
    STNAME count 
2  c  5 
5  s  4 
1  b  3 
0  a  2 
3  d  1 

但似乎你需要Series.nlargest

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].count().nlargest(5) 

或:

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'].size().nlargest(5) 

The difference between size and count is:

size counts NaN values, count does not.

样品:

df = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'), 
        'CTYNAME':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]}) 

print (df) 
    CTYNAME STNAME 
0   4  a 
1   5  b 
2   6  s 
3   5  c 
4   6  s 
5   2  c 
6   3  b 
7   4  c 
8   5  d 
9   6  b 
10  4  c 
11  5  s 
12  4  s 
13  3  c 
14  6  a 
15  5  e 

df = df[['STNAME','CTYNAME']].groupby(['STNAME'])['CTYNAME'] 
          .size() 
          .nlargest(5) 
          .reset_index(name='top5') 
print (df) 
    STNAME top5 
0  c  5 
1  s  4 
2  b  3 
3  a  2 
4  d  1 
+0

很好,谢谢你解释各种选项 – Rubans

2

我不知道你的DF究竟是如何模样。但是,如果你有一个由它计数几个类别的频率进行排序,很容易从DF切片A系列和排序的系列:

series = df.count().sort_values(ascending=False) 
series.head() 

注意,这个系列将使用类别为索引的名称!