熊猫数据框用NaN替换字符串使用pd.concat

我有一个由字符串组成的熊猫数据框，即'P1'，'P2'，'P3'，...，null。熊猫数据框用NaN替换字符串使用pd.concat

当我尝试连接这个数据框与另一个时，所有的字符串被替换为'NaN'。

看我下面的代码：

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 
descriptions['desc'] = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f1=pd.DataFrame(descriptions['desc']) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 
bugPrior['priority'] = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
f2=pd.DataFrame(bugPrior['priority']) 

df = pd.concat([f1,f2]) 
print(df.head())

输出如下：

   desc          priority 
0 Usability issue with external editors (1GE6IRL)  NaN 
1    API - VCM event notification (1G8G6RR)  NaN 
2 Would like a way to take a write lock on a tea...  NaN 
3 getter/setter code generation drops "F" in ".....  NaN 
4 Create Help Index Fails with seemingly incorre...  NaN

任何想法，我怎么可能会停止这种情况的发生？

最终，我的目标是将所有内容都放在一个数据框中，以便我可以删除所有具有“空”值的行。这也有助于后面的代码。

谢谢。

来源

2017-08-29 JohnWayne360

假设您想要水平连接这些列，您需要将axis=1传递给pd.concat，因为默认情况下，连接是垂直的。

df = pd.concat([f1,f2], axis=1)

要删除那些NaN行，你应该能够使用df.dropna。之后致电df.reset_index。

df = pd.concat([f1, f2], 1) 
df = df.dropna().reset_index(drop=True) 
print(df.head(10)) 
               desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2 
5 CCE in RenameResourceAction while renaming ele...  P3 
6 Option to prevent cursor from moving off end o...  P3 
7  Tasks section in the user doc is very stale  P3 
8 Importing existing project with different case...  P3 
9 Workspace in use --> choose new workspace but ...  P3

打印出来df.priority.unique()，我们看到有5个独特的工作重点：

print(df.priority.unique()) 
array(['P3', 'P2', 'P4', 'P1', 'P5'], dtype=object)

来源

2017-08-29 15:53:54

谢谢你的帮助，这个数据集已经在驱动m个坚果了，这只是数据导入！ – JohnWayne360

我认为最好不存在从列创建DataFrames：

descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json') 
descriptions = descriptions.reset_index(drop=1) 

#get Series to f1 
f1 = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f1.head()) 

bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json') 
bugPrior = bugPrior.reset_index(drop=1) 

#get Series to f2 
f2 = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what')) 
print (f2.head())

然后使用相同的解决方案cᴏʟᴅsᴘᴇᴇᴅ答案：

df = pd.concat([f1,f2], axis=1).dropna().reset_index(drop=True) 
print (df.head()) 
              short_desc priority 
0 Create Help Index Fails with seemingly incorre...  P3 
1 Internal compiler error when compiling switch ...  P3 
2 Default text sizes in org.eclipse.jface.resour...  P3 
3 [Presentations] [ViewMgmt] Holding mouse down ...  P3 
4 Parsing of function declarations in stdio.h is...  P2

来源

2017-08-29 16:03:30 jezrael

这正是我的答案。 :) –

没关系。您不必进行编辑，但谢谢，我很感激。 –

@jezrael感谢您的回答。我想我可能会应用您的建议并创建专栏。 – JohnWayne360

熊猫数据框用NaN替换字符串使用pd.concat

回答

相关问题