2017-07-30 97 views
1

我的任务是删除括号中的任何内容,并删除任何数字后跟国家/地区名称。改变一些国家的名字。pandas.replace与str.replace正则表达式冲突。代码顺序

例如 玻利维亚(多民族国)'应该'玻利维亚' 瑞士17'应该是'瑞士'。

我的原代码顺序为:

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1) 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
energy.loc[energy['Country'] == 'United States'] 

str.replace部分工作正常。任务已完成。 当我使用最后一行来检查我是否成功更改了国家/地区名称。此原始代码不起作用。但是,如果我更改代码的顺序为:

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

然后,它成功地改变了国家名称。 因此,我的Regex语法一定有什么问题,如何解决这个冲突?这是为什么发生?

+1

似乎有没有冲突。首先需要删除不必要的字符串部分,然后用字典替换。首先不起作用,因为没有匹配的字典键。 – jezrael

+0

对不起,我不明白,我所做的只是改变能量['Country'] = energy ['Country']的顺序。replace(dict1)Line。在弦乐部分没有编辑任何内容。为什么突然变得有效? – Dylan

+0

请检查我的答案 – jezrael

回答

3

的问题是,你需要regex=Truereplace用于替换substrings

energy = pd.DataFrame({'Country':['United States of America4', 
            'United States of America (aaa)','Slovakia']}) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

dict1 = { 
"Republic of Korea": "South Korea", 
"United States of America": "United States", 
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom", 
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

#no replace beacuse no match (numbers and()) 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
          Country 
0  United States of America4 
1 United States of America (aaa) 
2      Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
Empty DataFrame 
Columns: [Country] 
Index: [] 

energy['Country'] = energy['Country'].replace(dict1, regex=True) 
print (energy) 
       Country 
0  United States4 
1 United States (aaa) 
2    Slovakia 

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States 

#first data cleaning 
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') 
energy['Country'] = energy['Country'].str.replace('\d+', '') 
print (energy) 
        Country 
0 United States of America 
1 United States of America 
2     Slovakia 

#replace works nice 
energy['Country'] = energy['Country'].replace(dict1) 
print (energy) 
     Country 
0 United States 
1 United States 
2  Slovakia 

print (energy.loc[energy['Country'] == 'United States']) 
     Country 
0 United States 
1 United States 
+0

谢谢!我认为这些数据已经清除了名称为'United States of America'的数据。看完你的回答后,我用了:energy.loc [energy ['Country']。str.contains('^ United',na = False)]去检查。我发现原始数据是'美国20',难怪它找不到匹配。 – Dylan

+0

很高兴能帮到你!如果我的回答有帮助,请不要忘记[接受](http://meta.stackexchange.com/a/5235/295067) - 点击答案旁边的复选标记('✓')将其从灰色出来填补。谢谢。 – jezrael