2015-09-05 60 views
1

我的数据框有一列,并且逗号分隔值保存为一列。将熊猫数据标准化为一对多关系

from StringIO import StringIO 

myst="""india | 905034 | 19:44 | cricket, hockey 
USA | 905094 | 19:33 | swimming, running, tennis, football 
Russia | 905154 | 21:56 | basketball 

""" 
u_cols=['country', 'index', 'current_tm', 'sports'] 

myf = StringIO(myst) 
import pandas as pd 
df = pd.read_csv(StringIO(myst), sep='|', names = u_cols) 

是否有可能打破了细胞分化成几排这样的...

india cricket 
india hockey 
USA swimming 
USA running 
USA tennis 
USA football 
Russia basketball 

回答

2

您可以使用str.split,随后apply(pd.Series).stack()(该apply(pd.Series)使得元素的不同列,stack是把这个给行):

In [34]: df = df.set_index('country') 

In [36]: s = df['sports'].str.split(',').apply(pd.Series).stack() 

In [37]: s 
Out[37]: 
country 
india 0  cricket 
     1  hockey 
USA  0  swimming 
     1  running 
     2   tennis 
     3  football 
Russia 0  basketball 
dtype: object 

然后进一步清理一点点:

In [38]: s.reset_index(level=0).reset_index(drop=True) 
Out[38]: 
    country   0 
0 india  cricket 
1 india  hockey 
2  USA  swimming 
3  USA  running 
4  USA  tennis 
5  USA  football 
6 Russia basketball 

注意,近期熊猫,你可以用expand=True在str.split更换.apply(pd.Series)df['sports'].str.split(',', expand=True).stack()

+0

是否有可能合并像指数和current_tm为各自国家的列? – shantanuo

+0

是的,当然。你可以使用merge:'pd.merge(res,df,on ='country')'(假设'res'是上面的结果,'df'还有country列] – joris

+0

由于某种原因on子句没有工作。所以我试了right_index = True,left_index = True),它给出了正确的结果。谢谢。 – shantanuo