2017-04-18 69 views
2

要素的组合。假设我有一个这样的数据帧:获取来自不同大熊猫行

Date Artist   percent_gray percent_blue percent_black percent_red 
33 Leonardo    22   33   36   46 
45 Leonardo    23   47   23   14 
46 Leonardo    13   34   33   12 
23 Michelangelo   28   19   38   25 
25 Michelangelo   24   56   55   13 
26 Michelangelo   21   22   45   13 
13 Titian    24   17   23   22 
16 Titian    45   43   44   13 
19 Titian    17   45   56   13 
24 Raphael    34   34   34   45 
27 Raphael    31   22   25   67 

我想为同一艺术家不同的图片的最大色差。我也可以比较percent_graypercent_blue例如对于Lenoardo最大的区别是percent_red(date:46) - percent_blue(date:45) = 12 - 47 = -35。我想看看它随着时间的推移如何演变,所以我只想比较同一个艺术家的新图片与旧图片(在这种情况下,我可以比较第三张图片与第一张和第二张图片,第二张图片只与第一张图片比较),并获得最大的差异。所以数据帧应该看起来像

Date Artist   max_d 
33 Leonardo   NaN 
45 Leonardo   -32 
46 Leonardo   -35  
23 Michelangelo  NaN 
25 Michelangelo  37 
26 Michelangelo  -43 
13 Titian   NaN 
16 Titian   28 
19 Titian   43 
24 Raphael   NaN 
27 Raphael   33 

我想我必须使用groupby,但无法设法得到我想要的输出。

+0

你能解释一下吗?为什么提香在最大'56'和最小'13'之间不是'-43'?为什么第一个值是'NaN'?你如何获得'33'?谢谢。 – jezrael

+0

哦,对不起Titian它是-43,我只是手动做了。第一个值是NaN,因为它们是他们绘制的第一批图片,我只想比较那些较老的图片 –

+0

好的,你如何得到'-34,37,27,33'? – jezrael

回答

2

您可以使用:

#first sort in real data 
df = df.sort_values(['Artist', 'Date']) 
mi = df.iloc[:,2:].min(axis=1) 
ma = df.iloc[:,2:].max(axis=1) 
ma1 = ma.groupby(df['Artist']).shift() 
mi1 = mi.groupby(df['Artist']).shift() 
mad1 = mi - ma1 
mad2 = ma - mi1 
df['max_d'] = np.where(mad1.abs() > mad2.abs(), mad1, mad2) 
print (df) 
    Date  Artist percent_gray percent_blue percent_black \ 
0  33  Leonardo   22   33    36 
1  45  Leonardo   23   47    23 
2  46  Leonardo   13   34    33 
3  23 Michelangelo   28   19    38 
4  25 Michelangelo   24   56    55 
5  26 Michelangelo   21   22    45 
6  13  Titian   24   17    23 
7  16  Titian   45   43    44 
8  19  Titian   17   45    56 
9  24  Raphael   34   34    34 
10 27  Raphael   31   22    25 

    percent_red max_d 
0   46 NaN 
1   14 -32.0 
2   12 -35.0 
3   25 NaN 
4   13 37.0 
5   13 -43.0 
6   22 NaN 
7   13 28.0 
8   13 43.0 
9   45 NaN 
10   67 33.0 

解释(新列):

#get min and max per rows 
df['min'] = df.iloc[:,2:].min(axis=1) 
df['max'] = df.iloc[:,2:].max(axis=1) 
#get shifted min and max by Artist 
df['max1'] = df.groupby('Artist')['max'].shift() 
df['min1'] = df.groupby('Artist')['min'].shift() 
#get differences 
df['max_d1'] = df['min'] - df['max1'] 
df['max_d2'] = df['max'] - df['min1'] 
#if else of absolute values 
df['max_d'] = np.where(df['max_d1'].abs() > df['max_d2'].abs(), df['max_d1'], df['max_d2']) 
print (df) 
    percent_red min max max1 min1 max_d1 max_d2 max_d 
0   46 22 46 NaN NaN  NaN  NaN NaN 
1   14 14 47 46.0 22.0 -32.0 25.0 -32.0 
2   12 12 34 47.0 14.0 -35.0 20.0 -35.0 
3   25 19 38 NaN NaN  NaN  NaN NaN 
4   13 13 56 38.0 19.0 -25.0 37.0 37.0 
5   13 13 45 56.0 13.0 -43.0 32.0 -43.0 
6   22 17 24 NaN NaN  NaN  NaN NaN 
7   13 13 45 24.0 17.0 -11.0 28.0 28.0 
8   13 13 56 45.0 13.0 -32.0 43.0 43.0 
9   45 34 45 NaN NaN  NaN  NaN NaN 
10   67 22 67 45.0 34.0 -23.0 33.0 33.0 

而且如果使用第二种解释的解决方案,删除列:

df = df.drop(['min','max','max1','min1','max_d1', 'max_d2'], axis=1) 
print (df) 
    Date  Artist percent_gray percent_blue percent_black \ 
0  33  Leonardo   22   33    36 
1  45  Leonardo   23   47    23 
2  46  Leonardo   13   34    33 
3  23 Michelangelo   28   19    38 
4  25 Michelangelo   24   56    55 
5  26 Michelangelo   21   22    45 
6  13  Titian   24   17    23 
7  16  Titian   45   43    44 
8  19  Titian   17   45    56 
9  24  Raphael   34   34    34 
10 27  Raphael   31   22    25 

    percent_red max_d 
0   46 NaN 
1   14 -32.0 
2   12 -35.0 
3   25 NaN 
4   13 37.0 
5   13 -43.0 
6   22 NaN 
7   13 28.0 
8   13 43.0 
9   45 NaN 
10   67 33.0 
+0

如果我说按艺术家排序和日期会自动产生结果吗?我将运行代码,但是只要数据集很大,就需要花费很长时间。 –

+0

是的,排序首先是必须的。 – jezrael

1

自定义应用功能如何。这是否工作?

from operator import itemgetter 
import pandas 
import itertools 

p = pandas.read_csv('Artits.tsv', sep='\s+') 

def diff(x): 
    return x 

def max_any_color(cols): 
    grey = [] 
    blue = [] 
    black = [] 
    red = [] 
    for row in cols.iterrows(): 
     date = row[1]['Date'] 
     grey.append(row[1]['percent_gray']) 
     blue.append(row[1]['percent_blue']) 
     black.append(row[1]['percent_black']) 
     red.append(row[1]['percent_red']) 


    gb = max([abs(a[0] - a[1]) for a in itertools.product(grey,blue)]) 
    gblack = max([abs(a[0] - a[1]) for a in itertools.product(grey,black)]) 
    gr = max([abs(a[0] - a[1]) for a in itertools.product(grey,red)]) 
    bb = max([abs(a[0] - a[1]) for a in itertools.product(blue,black)]) 
    br = max([abs(a[0] - a[1]) for a in itertools.product(blue,red)]) 
    blackr = max([abs(a[0] - a[1]) for a in itertools.product(black,red)]) 

    l = [gb,gblack,gr,bb,br,blackr] 
    c = ['grey/blue','grey/black','grey/red','blue/black','blue/red','black/red'] 
    max_ = max(l) 
    between_colors_index = l.index(max_) 
    return c[between_colors_index], max_ 

p.groupby('Artist').apply(lambda x: max_any_color(x)) 

输出:

Leonardo   (blue/red, 35) 
Michelangelo  (blue/red, 43) 
Raphael   (blue/red, 45) 
Titian   (black/red, 43) 
+0

实际上,在我的真实数据集中,他们并不是真正的数字,我只是为了简单而编写数字,所以在得到差异之前,您将不会知道最大的差异 –

+0

函数发现最大不管它是什么 – jwillis0720