2016-01-20 93 views
0

只选择非ASCII字符:于默奥我想用来存储非ASCII字符UA在列表中。使用Python,如果我有一个字符串叫做MyString的,它已存储在它的字符串

下面是我的代码,它几乎工作,但列表中包含十六进制字符(即\ XC3 \ xa6),而不是正确编码字符:

try: 
    mystring.iloc[i].decode('ascii') 
    i+=1 
except: 
    nonascii_string = str(mystring.iloc[i]) 
    j=0 
    #now we've found the string, isolate the non ascii characters 
    for item in str(profile_data_nonascii_string): 
     try: 
     str(nonascii_string[j].decode('ascii')) 
     j+=1 
     except: 
     # PROBLEM: Need to work out how to encode back to proper UTF8 values 
     nonascii_chars_list.append(str(nonascii_string[j])) 
     j+=1 
     i+=1 
     pass 

我想我需要做的是这样:

chr(profile_data_nonascii_string[j].encode('utf-8')) 

但当然这样做只会选择我的多字节字符的第一个字节(并因此引发错误)。我确信有一个简单的解决方案...: - |

+0

请降低代码的短,**完整* *显示问题的程序。将该程序完整复制粘贴到您的问题中。有关更多信息,请参见[问]和[mcve]。 –

+0

这是Python3吗? –

+0

使用['codec.decode(string,errors ='ignore')'](https://docs.python.org/2/library/codecs.html#codec-base-classes) –

回答

0

您可以创建一个映射如果你想字符从字符串中删除,并str.translate他们:

In [29]: tbl = dict.fromkeys(range(128), u"") 

In [30]: s = u'Ümeå' 

In [31]: print(s.translate(tbl)) 
Üå 

在大熊猫w ^你似乎正在使用你可以使用pandas.Series.str.translate

Series.str.translate(表,deletechars =无)

Map all characters in the string through the given mapping table. Equivalent to standard str.translate(). Note that the optional argument deletechars is only valid if you are using python 2. For python 3, character deletion should be specified via the table argument.

translate将是更有效的比str.join

In [7]: s = 'Ümeå' * 1000 

In [8]: timeit ''.join([x for x in s if ord(x) > 127]) 
1000 loops, best of 3: 489 µs per loop 

In [9]: timeit s.translate(tbl) 
1000 loops, best of 3: 289 µs per loop 
In [10]: s.translate(tbl) == ''.join([x for x in s if ord(x) > 127]) 
Out[10]: True 

对于python2大熊猫,你需要deletechars无:

In [2]: import pandas as pd 

In [3]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])} 

In [4]: df = pd.DataFrame(raw_data, columns = ['Name']) 

In [5]: delete = "".join(map(chr,range(128))) 

In [6]: print df['Name'].str.translate(None, delete) 
a  
b åë 
c  
Name: Name, dtype: object 

使用dict python3正常工作:

In [9]: import pandas as pd 

In [10]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])} 

In [11]: 

In [11]: df = pd.DataFrame(raw_data, columns = ['Name']) 

In [12]: 

In [12]: delete = dict.fromkeys(range(128), "") 

In [13]: df['Name'].str.translate(delete) 
Out[13]: 
a  
b åë 
c  
Name: Name, dtype: object 

需要都记录不同的方法:

参数:

table : dict (python 3), str or None (python 2) In python 3, table is a mapping of Unicode ordinals to Unicode ordinals, strings, or None. Unmapped characters are left untouched. Characters mapped to None are deleted. str.maketrans() is a helper function for making translation tables. In python 2, table is either a string of length 256 or None. If the table argument is None, no translation is applied and the operation simply removes the characters in deletechars. string.maketrans() is a helper function for making translation tables. deletechars : str, optional (python 2) A string of characters to delete. This argument is only valid in python 2.

+0

感谢您的回答。我试图使用series.str.translate,但得到错误“AttributeError:只能使用.str访问器与字符串值,这在pandas中使用np.object_ dtype”......任何理由为什么?我似乎无法在ipython笔记本中得到这个工作,它似乎将我的系列转换为浮点类型并将所有值报告为NaN(尽管您的str.translate有效)....欣赏帮助 – Calamari

+0

如果您添加df的片段和你想要的结果将有所帮助 –

+0

**将pandas导入为pd ** \ n ** raw_data = {'Name':pd.Series(['david','åndrëw','calvin' ],index = ['a','b','c'])} ** \ n ** df = pd.DataFrame(raw_data,columns = ['Name'])** \ n * * tbl = dict.fromkeys(范围(128),u“”)** \ n 然后我尝试:** df ['Name']。str.translate(tbl)** 但是返回NaN值。 ..不知道我在做什么错.. – Calamari

2

这里是我分离的非ASCII字符从字符串的例子字符串:

In [7]: s=u'Ümeå' 

In [8]: print s 
Ümeå 

In [9]: s2 = u''.join(x for x in s if ord(x) > 126) 

In [10]: print s2 
Üå 

或者,如果你喜欢你的答案列表:

In [15]: s=u'Ümeå' 

In [16]: print s 
Ümeå 

In [17]: s2 = list(x for x in s if ord(x) > 126) 

In [18]: print s2[0] 
Ü 

In [19]: print s2[1] 
å 
+1

ASCII有128个字符,所以你需要'ord(x)> 127'。 – ekhumoro

+0

ASCII 127是DEL。我认为OP的意思是“可打印的ASCII”,所以我想'ord(x)> 126'。无论如何,如果OP适应我的代码,OP应该考虑您的意见。 –

+0

在这种情况下,你需要'not(31 ekhumoro

相关问题