2014-09-02 86 views
0

利用该样本数据帧数据:熊猫多个索引与多个集合函数

+------+--------+------+-------+------+--------+ 
| NAME | JOB | YEAR | MONTH | DAYS | SALARY | 
+------+--------+------+-------+------+--------+ 
| Bob | Worker | 2013 | 12 | 3 |  17 | 
| Mary | Employ | 2013 | 12 | 5 |  23 | 
| Bob | Worker | 2014 |  1 | 10 | 100 | 
| Bob | Worker | 2014 |  1 | 11 | 110 | 
| Mary | Employ | 2014 |  1 | 15 | 200 | 
| Bob | Worker | 2014 |  2 | 8 |  80 | 
| Mary | Employ | 2014 |  2 | 5 | 190 | 
+------+--------+------+-------+------+--------+ 

有一种简单的方式获得这样的输出,而无需手动创建所有的枢轴部分?

index=JOB,MAX(YEAR),NAME,SUM(DAYS) 
columns=MONTH 
values=SUM(SALARY) 

           +-----------+-------------+-------------+ 
           |  MONTH |   1 |   2 | 
    +--------+-----------+------+-----------+-------------+-------------+ 
    | JOB | MAX(YEAR) | NAME | SUM(DAYS) | SUM(SALARY) | SUM(SALARY) | 
    +--------+-----------+------+-----------+-------------+-------------+ 
    | Employ |  2014 | Mary |  29 |   210 |   190 | 
    | Worker |  2014 | Bob |  20 |   200 |   80 | 
    +--------+-----------+------+-----------+-------------+-------------+ 
+1

你可以吗以可复制粘贴的方式存储数据?说,字典? – Korem 2014-09-02 11:25:58

+0

{'name':'Bob','JOB':'Worker','YEAR':2013,'MONTH':12,'DAYS':3,'SALARY':17}, {'NAME' :'玛丽','工作':'雇用','年':2013,'月':12,'日':5,'薪水':23}, {'NAME':'Bob','工作':'工人','年':2014,'MONTH':1,'DAYS':10,'SALARY':100}, {'NAME':'Bob','JOB':'工人' YEAR':2014,'MONTH':1,'DAYS':11,'SALARY':110}, {'NAME':'Mary','JOB':'Employ','YEAR':2014,'MONTH ':1,'DAYS':15,'SALARY':200}, {'NAME':'Bob','JOB':'Worker','YEAR':2014,'MONTH':2,'DAYS' :8,'SALARY':80}, {'NAME':'玛丽','工作':'雇用','年份':2014年,'月':2,'天':5,'薪水': 190} ] – user3999503 2014-09-03 06:54:05

回答

1

从开始:

In [179]: df 
Out[179]: 
    NAME  JOB YEAR MONTH DAYS SALARY 
0 Bob Worker 2013  12  3  17 
1 Mary Employ 2013  12  5  23 
2 Bob Worker 2014  1 10  100 
3 Bob Worker 2014  1 11  110 
4 Mary Employ 2014  1 15  200 
5 Bob Worker 2014  2  8  80 
6 Mary Employ 2014  2  5  190 

我们可以得到大部分的数据,我们希望用

result = df.groupby(['JOB', 'NAME', 'MONTH', 'YEAR']).sum().reset_index(['MONTH']) 

#     MONTH DAYS SALARY 
# JOB NAME YEAR      
# Employ Mary 2014  1 15  200 
#    2014  2  5  190 
#    2013  12  5  23 
# Worker Bob 2014  1 21  210 
#    2014  2  8  80 
#    2013  12  3  17 

对此,我们添加的日子总和:

total_days = df.groupby(['JOB', 'NAME', 'YEAR'])[['DAYS']].sum() 
total_days.columns = ['SUM(DAYS)'] 

#     SUM(DAYS) 
# JOB NAME YEAR   
# Employ Mary 2013   5 
#    2014   20 
# Worker Bob 2013   3 
#    2014   29 

result = result.join(total_days) 
del result['DAYS'] 
#     MONTH SALARY SUM(DAYS) 
# JOB NAME YEAR       
# Employ Mary 2013  12  23   5 
#    2014  1  200   20 
#    2014  2  190   20 
# Worker Bob 2013  12  17   3 
#    2014  1  210   29 
#    2014  2  80   29 

选择与关联的行,我们计算

max_year = df.groupby(['JOB', 'NAME'])[['YEAR']].max() 
max_year = max_year.set_index('YEAR', drop=False, append=True) 

#     YEAR 
# JOB NAME YEAR  
# Employ Mary 2014 2014 
# Worker Bob 2014 2014 

这样的选择可以表示为一个左连接:

result = max_year.join(result) 
del result['YEAR'] 

#     MONTH SALARY SUM(DAYS) 
# JOB NAME YEAR       
# Employ Mary 2014  1  200   20 
#    2014  2  190   20 
# Worker Bob 2014  1  210   29 
#    2014  2  80   29 

现在我们可以将一个月到一个分层列级这样的:

result = result.set_index(['SUM(DAYS)', 'MONTH'], append=True) 
result = result.unstack('MONTH') 
result = result.reset_index(['SUM(DAYS)']) 

这产量

    SUM(DAYS) SALARY  
MONTH        1 2 
JOB NAME YEAR       
Employ Mary 2014   20  200 190 
Worker Bob 2014   29  210 80 
+0

谢谢,这是一个很好的例子。但以这种方式,我必须手动创建枢纽部分,是否没有办法直接使用枢轴或pivot_table? – user3999503 2014-09-03 06:58:01