2016-07-27 121 views
2

我有两个数据帧,如下比较熊猫dataframes和添加柱

df1  df2 
A  A C 
A1  A1 C1 
A2  A2 C2 
A3  A3 C3 
A1  A4 C4 
A2   
A3   
A4   

列的“A”在DF2中定义列“C”的值。 我想添加一个新列DF1与B列从DF2列“C”

它的价值最终DF1应该是这样的

df1 
A B 
A1 C1 
A2 C2 
A3 C3 
A1 C1 
A2 C2 
A3 C3 
A4 C4 

我可以遍历DF2和值添加到df1但由于数据庞大而耗时。

for index, row in df2.iterrows(): 
      df1.loc[df1.A.isin([row['A']]), 'B']= row['C'] 

有人可以帮助我了解如何解决这个问题,而无需循环播放df2。

感谢

回答

1

IIUC你可以合并,并重新命名山坳

df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 

In [103]: 
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']}) 
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']}) 
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
merged 

Out[103]: 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C4 
3 A1 C1 
4 A2 C2 
5 A3 C4 
6 A4 C4 
+0

谢谢大家的建议。我使用这个解决方案,因为它会将df2中的其他列合并到df1。谢谢@EdChum –

+0

'merge'和'map'之间也有语义上的区别,如果df1中的查找不存在于df2中,那么'merge'将插入'NaN',而'map'则会抛出'KeyError' – EdChum

1

可以使用map通过Series

df1['B'] = df1.A.map(df2.set_index('A')['C']) 
print (df1) 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C3 
3 A1 C1 
4 A2 C2 
5 A3 C3 
6 A4 C4 

是一样mapdict

d = df2.set_index('A')['C'].to_dict() 
print (d) 
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'} 

df1['B'] = df1.A.map(d) 
print (df1) 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C3 
3 A1 C1 
4 A2 C2 
5 A3 C3 
6 A4 C4 

时序

len(df1)=7

In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
1000 loops, best of 3: 1.73 ms per loop 

In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C']) 
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 873 µs per loop 

len(df1)=70k

In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
100 loops, best of 3: 12.8 ms per loop 

In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C']) 
100 loops, best of 3: 6.05 ms per loop 
+0

谢谢@jezreal –

+0

嗯,也许你可以upvote所有的解决方案,谢谢;) – jezrael

1

基于searchsorted方法,这里有三种方法与不同的索引方式 -

df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values 
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True) 
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]