pandas.replace与str.replace正则表达式冲突。代码顺序

我的任务是删除括号中的任何内容，并删除任何数字后跟国家/地区名称。改变一些国家的名字。pandas.replace与str.replace正则表达式冲突。代码顺序

例如玻利维亚（多民族国）'应该'玻利维亚' 瑞士17'应该是'瑞士'。

我的原代码顺序为：

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1) 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
energy.loc[energy['Country'] == 'United States']

的str.replace部分工作正常。任务已完成。当我使用最后一行来检查我是否成功更改了国家/地区名称。此原始代码不起作用。但是，如果我更改代码的顺序为：

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然后，它成功地改变了国家名称。因此，我的Regex语法一定有什么问题，如何解决这个冲突？这是为什么发生？

来源

2017-07-30 Dylan

似乎有没有冲突。首先需要删除不必要的字符串部分，然后用字典替换。首先不起作用，因为没有匹配的字典键。 – jezrael

对不起，我不明白，我所做的只是改变能量['Country'] = energy ['Country']的顺序。replace（dict1）Line。在弦乐部分没有编辑任何内容。为什么突然变得有效？ – Dylan

请检查我的答案 – jezrael

的问题是，你需要regex=Truereplace用于替换substrings：

energy = pd.DataFrame({'Country':['United States of America4', 
            'United States of America (aaa)','Slovakia']}) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"}

#no replace beacuse no match (numbers and()) 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
Empty DataFrame 
Columns: [Country] 
Index: []

energy['Country'] = energy['Country'].replace(dict1, regex=True) 
print (energy) 
       Country 
0  United States4 
1 United States (aaa) 
2    Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States

#first data cleaning 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

#replace works nice 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States

来源

2017-07-30 08:26:39 jezrael

谢谢！我认为这些数据已经清除了名称为'United States of America'的数据。看完你的回答后，我用了：energy.loc [energy ['Country']。str.contains（'^ United'，na = False）]去检查。我发现原始数据是'美国20'，难怪它找不到匹配。 – Dylan

很高兴能帮到你！如果我的回答有帮助，请不要忘记[接受]（http://meta.stackexchange.com/a/5235/295067） - 点击答案旁边的复选标记（'✓'）将其从灰色出来填补。谢谢。 – jezrael

pandas.replace与str.replace正则表达式冲突。代码顺序

回答

相关问题