2016-03-03 87 views
1

我有以下的数据帧:快速分离行

import pandas as pd 
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"], 
        'Genes':["Canx","LOC101056688 /// Wars "], 
        'cv_filter':[ 0.134,0.290], 
        'Organ' :["LN","LV"]} )  
df = df[["Probes","Genes","cv_filter","Organ"]] 

它看起来像这样:

In [16]: df 
Out[16]: 
     Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at LOC101056688 /// Wars  0.290 LV 

我想要做的就是拆分行基于其中条目 的基因列由'///'分隔。

我希望得到的结果是

 Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at   LOC101056688  0.290 LV 
2 1415693_at     Wars  0.290 LV 

我总共有15万〜行检查。有没有一种快速的方法来处理?

回答

1

你可以尝试先str.splitGenes,创造新的Seriesjoin它原来df

import pandas as pd 
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"], 
        'Genes':["Canx","LOC101056688 /// Wars "], 
        'cv_filter':[ 0.134,0.290], 
        'Organ' :["LN","LV"]} )  
df = df[["Probes","Genes","cv_filter","Organ"]] 
print df 
     Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at LOC101056688 /// Wars  0.290 LV 

s = pd.DataFrame([ x.split('///') for x in df['Genes'].tolist() ], index=df.index).stack() 
#or you can use approach from comment 
#s = df['Genes'].str.split('///', expand=True).stack() 

s.index = s.index.droplevel(-1) 
s.name = 'Genes' 
print s 
0    Canx 
1 LOC101056688 
1   Wars 
Name: Genes, dtype: object 

#remove original columns, because error: 
#ValueError: columns overlap but no suffix specified: Index([u'Genes'], dtype='object')  
df = df.drop('Genes', axis=1) 

df = df.join(s).reset_index(drop=True) 
print df[["Probes","Genes","cv_filter","Organ"]] 
     Probes   Genes cv_filter Organ 
0 1415693_at   Canx  0.134 LN 
1 1415693_at LOC101056688  0.290 LV 
2 1415693_at   Wars  0.290 LV 
+0

为什么不'DF [ '基因'] str.split( '///',扩大= True).stack()'而不是'df ['Genes']。str.split('///')。apply(pd.Series,1).stack()'。它快了两倍 –

+0

@AntonProtopopov - 谢谢。我将它添加到我的答案中作为替代解决方案(只比DataFrame构造函数慢一点点)。 – jezrael

+0

对于那个解决方案你的's'是没有多索引的DataFrame .. –