2017-09-16 93 views
2

我正在努力解决一个相当特殊的问题。我有两个不同长度,不同索引的熊猫数据框。对于df1中包含的每个项目,我想查看df2并获取一些列(不包含在df1中),其中一个df2列的值等于df1中的列。例如:熊猫从第二个数据框中选择列,其中另一列的值存在于主数据框中

import pandas as pd 

data_1 = {'TARGET_NAME':['fishinghook', 'doorlock', 'penguin', 'ashtray', 'cat', 'elephant', 'cupcake', 'exercisebench'], 
      'FOOBAR':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'], 
      'ix':[320, 321, 322, 323, 324, 325, 326, 328]} 

data_2 = {'IMAGE_NAME':['cat', 'penguin', 'jewelrybox', 'exercisebench', 'doorlock', 'jar', ], 
      'VALUES_1':['h', 'h', 'c', 'm', 'h', 'f'], 
      'VALUES_2':['hm', 'hl', 'cm', 'ml', 'hh', 'fl'], 
      'ix':[616, 617, 618, 619, 620, 621]} 

desired = {'TARGET_NAME':['fishinghook', 'doorlock', 'penguin', 'ashtray', 'cat', 'elephant', 'cupcake', 'exercisebench'], 
      'FOOBAR':['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'], 
      'PRODUCED_VALUES_1':['DROPPED', 'h', 'h', 'DROPPED', 'h', 'DROPPED', 'DROPPED', 'm'], 
      'ix':[320, 321, 322, 323, 324, 325, 326, 328]} 

df1 = pd.DataFrame(data_1, index=data_1['ix']) 
df2 = pd.DataFrame(data_2, index=data_2['ix']) 
desired_df = pd.DataFrame(desired, index=desired['ix']) 

df1 
Out[2]: 
    FOOBAR TARGET_NAME ix 
320 foo fishinghook 320 
321 bar  doorlock 321 
322 foo  penguin 322 
323 bar  ashtray 323 
324 foo   cat 324 
325 bar  elephant 325 
326 foo  cupcake 326 
328 bar exercisebench 328 

df2 
Out[3]: 
     IMAGE_NAME VALUES_1 VALUES_2 ix 
616   cat  h  hm 616 
617  penguin  h  hl 617 
618  jewelrybox  c  cm 618 
619 exercisebench  m  ml 619 
620  doorlock  h  hh 620 
621   jar  f  fl 621 

desired_df 
Out[4]: 
    FOOBAR PRODUCED_VALUES_1 TARGET_NAME ix 
320 foo   DROPPED fishinghook 320 
321 bar     h  doorlock 321 
322 foo     h  penguin 322 
323 bar   DROPPED  ashtray 323 
324 foo     h   cat 324 
325 bar   DROPPED  elephant 325 
326 foo   DROPPED  cupcake 326 
328 bar     m exercisebench 328 

我想看看在DF1每个值 'TARGET_NAME'],并在它等于DF2 [ 'IMAGE_NAME'],走VALUES_1从DF2 & VALUES_2列,并添加这些细节DF1(或df1的副本)。如果它在df2中的任何地方都不匹配(因为位置都不同),那么我希望它写入其他内容(例如'删除')。理想情况下,我希望df1索引保持不变。

任何帮助表示赞赏!

回答

3

您可以通过重命名列来重新合并数据,然后使用所需的列名称重命名列,然后使用dropped填充generated_values,然后删除nans。最后设置了df1指数。

ndf = df1.merge(df2.rename(columns = {'IMAGE_NAME':'TARGET_NAME'}),how='outer',on='TARGET_NAME') 
ndf = ndf.drop(['ix_y','VALUES_2'],1).rename(columns={'ix_x':'ix','VALUES_1':'PRODUCED_VALUES_1'}) 

ndf['PRODUCED_VALUES_1'] = ndf['PRODUCED_VALUES_1'].fillna('Dropped') 
ndf = ndf.dropna().set_index(df1.index) 
 
    FOOBAR TARGET_NAME  ix PRODUCED_VALUES_1 
320 foo fishinghook 320.0   Dropped 
321 bar  doorlock 321.0     h 
322 foo  penguin 322.0     h 
323 bar  ashtray 323.0   Dropped 
324 foo   cat 324.0     h 
325 bar  elephant 325.0   Dropped 
326 foo  cupcake 326.0   Dropped 
328 bar exercisebench 328.0     m 
+0

完美!谢谢。 – fffrost

1
In [34]: df1['PRODUCED_VALUES_1'] = \ 
      df1['TARGET_NAME'].map(df2.set_index('IMAGE_NAME')['VALUES_1']) \ 
           .fillna('DROPPED') 

In [35]: df1 
Out[35]: 
    FOOBAR TARGET_NAME ix PRODUCED_VALUES_1 
320 foo fishinghook 320   DROPPED 
321 bar  doorlock 321     h 
322 foo  penguin 322     h 
323 bar  ashtray 323   DROPPED 
324 foo   cat 324     h 
325 bar  elephant 325   DROPPED 
326 foo  cupcake 326   DROPPED 
328 bar exercisebench 328     m 

或一个班轮这类似于@Bharath谢蒂的解决方案:

In [26]: df1.merge(df2[['IMAGE_NAME','VALUES_1']].rename(columns={'IMAGE_NAME':'TARGET_NAME'}), 
    ...:   how='left') \ 
    ...: .fillna('DROPPED') \ 
    ...: .rename(columns=lambda c: 'PRODUCED_' + c if c=='VALUES_1' else c) \ 
    ...: .set_index(df1.index) 
    ...: 
Out[26]: 
    FOOBAR TARGET_NAME ix PRODUCED_VALUES_1 
320 foo fishinghook 320   DROPPED 
321 bar  doorlock 321     h 
322 foo  penguin 322     h 
323 bar  ashtray 323   DROPPED 
324 foo   cat 324     h 
325 bar  elephant 325   DROPPED 
326 foo  cupcake 326   DROPPED 
328 bar exercisebench 328     m 
+1

你忘记了索引。 – Dark

+0

@Bharathshetty,谢谢!它现在已经修复了...... – MaxU

+0

我的第一个方法是map,但不能得到解决方案,所以去合并。这真的很棒 – Dark

相关问题