2017-02-03 93 views
0

鉴于问题groupby()nlargest()如上所述herehere。我正在努力解决这些问题。切片原始DF()。nlargest(x)的操作

注意:为简单起见,我使用nlargest(1),但是,它可以是任意数量的选择。

{'city1': {0: 'Chicago', 
    1: 'Chicago', 
    2: 'Chicago', 
    3: 'Chicago', 
    4: 'Miami', 
    5: 'Houston', 
    6: 'Austin'}, 
'city2': {0: 'Toronto', 
    1: 'Detroit', 
    2: 'St.Louis', 
    3: 'Miami', 
    4: 'Dallas', 
    5: 'Dallas', 
    6: 'Dallas'}, 
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0}, 
'plant1_type': {0: 'COMBCYCL', 
    1: 'COMBCYCL', 
    2: 'NUKE', 
    3: 'COAL', 
    4: 'NUKE', 
    5: 'COMBCYCL', 
    6: 'COAL'}, 
'plant2_type': {0: 'COAL', 
    1: 'COAL', 
    2: 'COMBCYCL', 
    3: 'COMBCYCL', 
    4: 'COAL', 
    5: 'NUKE', 
    6: 'NUKE',}} 

A)GROUPBY city1并返回从原始选择的行DF

cols2 = ['city1','plant1_type','plant2_type'] 
df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3] 

    city1 city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0 COAL  NUKE 
3 Chicago Miami  0.5 COAL  COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL COAL 
2 Chicago St.Louis  2.0 NUKE  COMBCYCL 
5 Houston Dallas  4.0 COMBCYCL NUKE 
4 Miami Dallas  1.0 NUKE  COAL 

上面看起来不错

B)GROUPBY city2并返回从原始DF

选定的行由于#A中使用的相同代码在尝试groupby city2时会生成伪造结果,建议采取解决方法以下内容:

cols = ['city2','plant1_type','plant2_type'] 
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1) 


city2  plant1_type plant2_type 
Toronto COMBCYCL  COAL   5.0 
Detroit COMBCYCL  COAL   4.0 
St.Louis NUKE   COMBCYCL  2.0 
Miami  COAL   COMBCYCL  0.5 
Dallas NUKE   COAL   1.0 
      COMBCYCL  NUKE   4.0 
      COAL   NUKE   3.0 

现在怎么办我用这个结果返回从原来选择的行DF正如我在#A做?

:有原始的DF有一个附加行,对于city2具有基团由groupby.nlargest()结果,其中至少一个组具有尺寸小于1,则在#A的代码可以用于#B更大。

回答

2

除非我错过了一些东西(我同意这里有潜伏在熊猫代码中的错误),我们可以相对简单地绕过任何困难。

方法1:使用locidxmax

In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()] 
Out[21]: 
    city1  city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0  COAL  NUKE 
3 Chicago  Miami  0.5  COAL COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL  COAL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
4 Miami Dallas  1.0  NUKE  COAL 

In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()] 
Out[22]: 
    city1  city2 p234_r_c plant1_type plant2_type 
6 Austin Dallas  3.0  COAL  NUKE 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
4 Miami Dallas  1.0  NUKE  COAL 
1 Chicago Detroit  4.0 COMBCYCL  COAL 
3 Chicago  Miami  0.5  COAL COMBCYCL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
0 Chicago Toronto  5.0 COMBCYCL  COAL 

方法2:排序p234_r_c和使用last

In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last() 
Out[17]: 
    city1 plant1_type plant2_type  city2 p234_r_c 
0 Austin  COAL  NUKE Dallas  3.0 
1 Chicago  COAL COMBCYCL  Miami  0.5 
2 Chicago COMBCYCL  COAL Toronto  5.0 
3 Chicago  NUKE COMBCYCL St.Louis  2.0 
4 Houston COMBCYCL  NUKE Dallas  4.0 
5 Miami  NUKE  COAL Dallas  1.0 

In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last() 
Out[18]: 
     city2 plant1_type plant2_type city1 p234_r_c 
0 Dallas  COAL  NUKE Austin  3.0 
1 Dallas COMBCYCL  NUKE Houston  4.0 
2 Dallas  NUKE  COAL Miami  1.0 
3 Detroit COMBCYCL  COAL Chicago  4.0 
4  Miami  COAL COMBCYCL Chicago  0.5 
5 St.Louis  NUKE COMBCYCL Chicago  2.0 
6 Toronto COMBCYCL  COAL Chicago  5.0 

如果你希望能够得到多反应也是如此,尽管最小和最小的都被破坏了,但我认为最简单的方法是排序然后使用头部或尾部。例如:

In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2) 
Out[27]: 
    city1  city2 p234_r_c plant1_type plant2_type 
3 Chicago  Miami  0.5  COAL COMBCYCL 
4 Miami Dallas  1.0  NUKE  COAL 
2 Chicago St.Louis  2.0  NUKE COMBCYCL 
6 Austin Dallas  3.0  COAL  NUKE 
1 Chicago Detroit  4.0 COMBCYCL  COAL 
5 Houston Dallas  4.0 COMBCYCL  NUKE 
0 Chicago Toronto  5.0 COMBCYCL  COAL 
+0

说,如果我用'方法#1'和做仅使用'COLS = [ 'city1']''一个和groupby'希望'最大2(或N)p234_r_c'。我用'N = 2'尝试了以下内容,结果与'N = 1'相同。 'df.loc [df.groupby(cols2)[“p234_r_c”]。idxmax(2)]' – codingknob

+0

对于'N = 2',我们应该有2行芝加哥。即下面一行丢失:'芝加哥\t底特律\t 4.0 \t COMBCYCL \t COAL' – codingknob

+0

@codingknob:'idxmax'没有'n'参数,所以如果有文档在某处暗示它,请提交一个错误,因为我们需要修理它。 :-( – DSM