2017-08-08 86 views
1

我有两个数据集,这样如何合并重叠列

import pandas as pd 
import numpy as np 
df1 = pd.DataFrame({'id': [1, 2,3,4,5], 'first': [np.nan,np.nan,1,0,np.nan], 'second': [1,np.nan,np.nan,np.nan,0]}) 
df2 = pd.DataFrame({'id': [1, 2,3,4,5, 6], 'first': [np.nan,1,np.nan,np.nan,0, 1], 'third': [1,0,np.nan,1,1, 0]}) 

我想

result = pd.merge(df1, df2, left_index=True, right_index=True,on='id', how= 'outer') 
result['first']= result[["first_x", "first_y"]].sum(axis=1) 
result.loc[(result['first_x'].isnull()) & (result['first_y'].isnull()), 'first'] = np.nan 
result.drop(['first_x','first_y'] , 1) 

    id second third first 
0 1 1.0  1.0 NaN 
1 2 NaN  0.0 1.0 
2 3 NaN  NaN 1.0 
3 4 NaN  1.0 0.0 
4 5 0.0  1.0 0.0 
5 6 NaN  0.0 1.0 

的问题是,真正的数据集包括大约200个变量和我的路很长。如何使它更容易?由于

回答

1

您应该能够使用combine_first

>>> df1.set_index('id').combine_first(df2.set_index('id')) 
    first second third 
id      
1  NaN  1  1 
2  1  NaN  0 
3  1  NaN NaN 
4  0  NaN  1 
5  0  0  1 
6  1  NaN  0 
0

也许应该使用combine_first由亚历山大提及。如果您想保留id作为列,您只需使用:

merged = df1.merge(df2)