2015-01-31 72 views
0

我不知道发生了什么,标题只是一阶近似。我试图把两个数据帧:熊猫加入:无法识别加入列

>>> df_sum.head() 
     TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \ 
0 20030100013280  0  0  0  0  0  0 
1 20030100013344  0  0  0  0  0  0 
2 20030100013352  60  0  0  0  0  0 
3 20030100013848  0  0  0  0  0  0 
4 20030100014165  0  0  0  0  0  0 

    t070201 t070299 shopping year 
0  0  0   0 2003 
1  0  0   0 2003 
2  0  0  60 2003 
3  0  0   0 2003 
4  0  0   0 2003 
>>> emp.head() 
     TUCASEID status 
0 20030100013280 emp 
1 20030100013344 emp 
2 20030100013352 emp 
4 20030100014165 emp 
5 20030100014169 emp 

这是该数据帧,我想加入他们在公共列TUCASEID,其中有交叉:

>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID) 
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462, 
     20131212132469, 20131212132475]) 

现在...

>>> df_sum.join(emp, on='TUCASEID', how='inner') 
Traceback (most recent call last): 
    File "<input>", line 1, in <module> 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join 
    rsuffix=rsuffix, sort=sort) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat 
    suffixes=(lsuffix, rsuffix), sort=sort) 
    File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge 
    return op.get_result() 
    File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result 
    rdata.items, rsuf) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix 
    to_rename) 
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object') 

嗯,这很奇怪,出现在这两个数据帧的唯一列是一个参加过,但是那好,我们同意[1]:

>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r') 
Empty DataFrame 
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status] 
Index: [] 

尽管存在巨大的交叉点。这里发生了什么?

>>> pd.__version__ 
'0.15.0' 

[1]:我实际上执行整数为D型接合柱的,因为它表示“对象”在那里,并没有区别:

>>> emp.dtypes 
TUCASEID  int64 
status  object 
dtype: object 
>>> df_sum.dtypes 
TUCASEID int64 
(...) 
shopping int64 
year  int64 
dtype: object 
+0

您的索引值不匹配,为什么不干脆 此外,所谓的这种方式,当合并为空合并它们'df_sum.merge(emp,on ='TUCASEID',how ='outer')'或者你只是想为每个'TUCASEID'行添加'status'列感兴趣?在这种情况下做'df_sum ['status'] = df ['sum ['TUCASEID']。map(emp.set_index('TUCASEID')' – EdChum 2015-01-31 22:24:13

+0

@EdChum好吧,我想看看替代方案。索引值不匹配?我已经指定了替代'on ='列。 – FooBar 2015-01-31 22:25:39

+0

不知道'join'加在索引上,奇怪的是我可以重新创建的行为,但是我建议应该使用的其他方法 – EdChum 2015-01-31 22:27:04

回答

2

df.join通常调用pd.merge(除了在特殊情况下当它呼叫concat)。因此,任何东西join都可以做,merge也可以做 也。虽然可能不是严格正确,但我倾向于仅在 加入索引时使用df.join,并使用pd.merge加入列。

因此,我可以重现这个问题你描述:

import numpy as np 
import pandas as pd 

df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
         index=list('ABCDEF'), columns=list('XY')) 
emp = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
        index=list('ABCDEF'), columns=list('XZ')) 
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner')) 

# Empty DataFrame 
# Columns: [X, Y, X_r, Z] 
# Index: [] 

pd.merge按预期工作 - 而无需提供rsuffix

print(pd.merge(df_sum, emp, on='X') 

产量

X Y Z 
0 0 1 1 
1 2 3 3 
2 4 5 5 
3 6 7 7 
4 8 9 9 
5 10 11 11 

Under the hooddf_sum.join通话合并这种方式:

if isinstance(other, DataFrame): 
     return merge(self, other, left_on=on, how=how, 
        left_index=on is None, right_index=True, 
        suffixes=(lsuffix, rsuffix), sort=sort) 

所以,即使您使用df_sum.join(emp, on='...'),引擎盖下,熊猫转换这pd.merge(df_sum, emp, left_on='...')

In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True) 
Out[228]: 
Empty DataFrame 
Columns: [X, X_x, Y, X_y, Z] 
Index: [] 

因为所需的left_on='X'需求是on='X'为合并成功:

In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True) 
Out[233]: 
    X Y Z 
A 0 1 1 
B 2 3 3 
C 4 5 5 
D 6 7 7 
E 8 9 9 
F 10 11 11