2017-10-17 77 views
0

我有一个数据帧DF看起来像:标准化值在数据帧列

id colour response 
1 blue curent 
2 red loaning 
3 yellow current 
4 green  loan 
5 red currret 
6 green  loan 

可以在响应列中看到的值不统一,我想获得的捕捉到一个标准化的答复。

我也有一个验证列表validate它看起来像

validate 
current 
    loan 
transfer 

我想规范基础上对验证列表

所以条目的前三个字符的DF响应列最终输出将如下所示:

id colour response 
1 blue current 
2 red  loan 
3 yellow current 
4 green  loan 
5 red current 
6 green  loan 

曾尝试使用的fnmatch

pattern = 'cur*' 
fnmatch.filter(df, pattern) = 'current' 

但无法更改df中的值。

如果有人可以提供协助我们将不胜感激

感谢

回答

2

你可以使用map

In [3664]: mapping = dict(zip(s.str[:3], s)) 

In [3665]: df.response.str[:3].map(mapping) 
Out[3665]: 
0 current 
1  loan 
2 current 
3  loan 
4 current 
5  loan 
Name: response, dtype: object 

In [3666]: df['response2'] = df.response.str[:3].map(mapping) 

In [3667]: df 
Out[3667]: 
    id colour response response2 
0 1 blue curent current 
1 2  red loaning  loan 
2 3 yellow current current 
3 4 green  loan  loan 
4 5  red currret current 
5 6 green  loan  loan 

哪里s是一系列的验证值。

In [3650]: s 
Out[3650]: 
0  current 
1  loan 
2 transfer 
Name: validate, dtype: object 

详细

In [3652]: mapping 
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'} 

mapping可以串联太

In [3678]: pd.Series(s.str[:3].values, index=s.values) 
Out[3678]: 
current  cur 
loan  loa 
transfer tra 
dtype: object 
+0

谢谢,它适用于验证字典中的值。如果由于某种原因,响应列中有不在字典中的值(比如'transfer'),是否有办法标记这个?再次感谢 – Stacey

0

模糊匹配?

from fuzzywuzzy import fuzz 
from fuzzywuzzy import process 
a=[] 
for x in df.response: 
    a.append([process.extract(x, val.validate, limit=1)][0][0][0]) 
df['response2']=a 
df 
Out[867]: 
    id colour response response2 
0 1 blue curent current 
1 2  red loaning  loan 
2 3 yellow current current 
3 4 green  loan  loan 
4 5  red currret current 
5 6 green  loan  loan