2016-04-23 135 views
2

我有以下两个数据帧(可以发现herehere):熊猫:dataframes不会合并

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \ 
       engine='python', sep=r'\s{2,}', encoding='utf-8_sig') 

我只提供的代码为df阅读,因为它有一些独特的格式问题。

df.dtypes 

SICcode  object 
Catcode  object 
Category object 
SICname  object 
MultSIC  object 
dtype: object 

merged.dtypes 

2012 NAICS Code  float64 
2002to2007 NAICS float64 
SICcode    object 
dtype: object 

df.columns.tolist() 
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC'] 

merged.columns.tolist() 
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode'] 

df.head(3) 

    SICcode  Catcode  Category       SICname MultSIC 
0 111   A1500 Wheat, corn, soybeans and cash grain Wheat X 
1 112   A1600 Other commodities (incl rice, peanuts) Rice X 
2 115   A1500 Wheat, corn, soybeans and cash grain Corn X 

merged.sort_values('SICcode') 

    2012 NAICS Code  2002to2007 NAICS SICcode 
89 212210      212210  1011 
93 212234      212234  1021 
92 212231      212231  1031 
90 212221      212221  1041 
91 212222      212222  1044 
96 212299      212299  1061 
94 212234      212234  1061 
119 213114      213114  1081 
1770 541360     541360  1081 
233  238910     238910  1081 
95 212291      212291  1094 
97 212299      212299  1099 
3 111140      111140  111 
6 111160      111160  112 
4 111150      111150  115 
0 111110      111110  116 

我想他们这个代码合并到一起:merged=pd.merge(merged,df, how='right', on='SICcode')

导致此:

2012 NAICS Code  0 
2002to2007 NAICS  0 
SICcode    1007 
Catcode    991 
Category   1007 
SICname    1007 
MultSIC    906 
dtype: int64 

我怀疑问题在于的df格式,但我不知道如何描述(我听说过white space这个词,可能与这种情况有关)或者解决这个问题。有没有人有这个想法?

回答

2

我相信这是你的问题的原因:

In [47]: merged[merged.SICcode == 'Aux'] 
Out[47]: 
     2012 NAICS Code 2002to2007 NAICS SICcode 
1828   551114.0   551114.0  Aux 

导致不同的数据类型:

In [61]: df.dtypes 
Out[61]: 
SICcode  int64 
Catcode  object 
Category object 
SICname  object 
MultSIC  object 
dtype: object 

In [62]: merged.dtypes 
Out[62]: 
2012 NAICS Code  float64 
2002to2007 NAICS float64 
SICcode    object 
dtype: object 

In [63]: df.SICcode.unique() 
Out[63]: array([ 111, 112, 115, ..., 9711, 9721, 9999], dtype=int64) 

In [64]: merged.SICcode.head(10).unique() 
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object) 

所以,你可以这样来做:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv' 
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig') 

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge' 
merged = pd.read_csv(url, index_col=0) 

# clearing data 
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce') 

mrg = df.merge(merged, on='SICcode', how='left') 

mrg.head() 

输出:

In [51]: mrg.head() 
Out[51]: 
    SICcode Catcode          Category \ 
0  111 A1500   Wheat, corn, soybeans and cash grain 
1  112 A1600 Other commodities (incl rice, peanuts, honey) 
2  115 A1500   Wheat, corn, soybeans and cash grain 
3  116 A1500   Wheat, corn, soybeans and cash grain 
4  119 A1500   Wheat, corn, soybeans and cash grain 

      SICname MultSIC 2012 NAICS Code 2002to2007 NAICS 
0    Wheat  X   111140.0   111140.0 
1    Rice  X   111160.0   111160.0 
2    Corn  X   111150.0   111150.0 
3   Soybeans  X   111110.0   111110.0 
4 Cash grains, NEC  X   111120.0   111120.0 
+0

谢谢MaxU! –

+1

@MichaelPeddue,总是乐于帮助:) – MaxU