2017-08-29 90 views
1

我有一个由字符串组成的熊猫数据框,即'P1','P2','P3',...,null。熊猫数据框用NaN替换字符串使用pd.concat

当我尝试连接这个数据框与另一个时,所有的字符串被替换为'NaN'。

看我下面的代码:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 
descriptions['desc'] = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f1=pd.DataFrame(descriptions['desc']) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 
bugPrior['priority'] = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f2=pd.DataFrame(bugPrior['priority']) 

df = pd.concat([f1,f2]) 
print(df.head()) 

输出如下:

   desc          priority 
0 Usability issue with external editors (1GE6IRL)  NaN 
1    API - VCM event notification (1G8G6RR)  NaN 
2 Would like a way to take a write lock on a tea...  NaN 
3 getter/setter code generation drops "F" in ".....  NaN 
4 Create Help Index Fails with seemingly incorre...  NaN 

任何想法,我怎么可能会停止这种情况的发生?

最终,我的目标是将所有内容都放在一个数据框中,以便我可以删除所有具有“空”值的行。这也有助于后面的代码。

谢谢。

回答

2

假设您想要水平连接这些列,您需要将axis=1传递给pd.concat,因为默认情况下,连接是垂直的。

df = pd.concat([f1,f2], axis=1) 

要删除那些NaN行,你应该能够使用df.dropna。之后致电df.reset_index

df = pd.concat([f1, f2], 1) 
df = df.dropna().reset_index(drop=True) 
print(df.head(10)) 
               desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2 
5 CCE in RenameResourceAction while renaming ele...  P3 
6 Option to prevent cursor from moving off end o...  P3 
7  Tasks section in the user doc is very stale  P3 
8 Importing existing project with different case...  P3 
9 Workspace in use --> choose new workspace but ...  P3 

打印出来df.priority.unique(),我们看到有5个独特的工作重点:

print(df.priority.unique()) 
array(['P3', 'P2', 'P4', 'P1', 'P5'], dtype=object) 
+0

谢谢你的帮助,这个数据集已经在驱动m个坚果了,这只是数据导入! – JohnWayne360

2

我认为最好不存在从列创建DataFrames:

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 

#get Series to f1 
f1 = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f1.head()) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 

#get Series to f2 
f2 = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f2.head()) 

然后使用相同的解决方案cᴏʟᴅsᴘᴇᴇᴅ答案:

df = pd.concat([f1,f2], axis=1).dropna().reset_index(drop=True) 
print (df.head()) 
              short_desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2 
+0

这正是我的答案。 :) –

+0

没关系。您不必进行编辑,但谢谢,我很感激。 –

+1

@jezrael感谢您的回答。我想我可能会应用您的建议并创建专栏。 – JohnWayne360

相关问题