Imdb评论编码错误

我试图建立一个RNN模型，将评论分为正面或负面情绪。Imdb评论编码错误

有一个词汇的词汇，在预处理过程中，我对一些索引序列进行了回顾。
例如，

"This movie was best" --> [2,5,10,3]

当我试图让频繁vocabs并查看其内容，我得到这个错误：

num of reviews 100 
number of unique tokens : 4761 
Traceback (most recent call last): 
    File "preprocess.py", line 47, in <module> 
    print(vocab) 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)

代码如下所示：

from bs4 import BeautifulSoup 
reviews = [] 
for item in os.listdir('imdbdata/train/pos')[:100]: 
    with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f: 
     sample = BeautifulSoup(f.read()).get_text() 
    sample = word_tokenize(sample.lower()) 
    reviews.append(sample) 
print("num of reviews", len(reviews)) 
word_freq = nltk.FreqDist(itertools.chain(*reviews)) 
print("number of unique tokens : %d"%(len(word_freq.items()))) 
vocab = word_freq.most_common(vocab_size-1) 
index_to_word = [x[0] for x in vocab] 
index_to_word.append(unknown_token) 
word_to_index = dict((w,i) for i,w in enumerate(index_to_word)) 
print(vocab)

问题是，当我用Python处理NLP问题时，如何才能摆脱这个UnicodeEncodeError？特别是在使用open函数获取文本时。

来源

2017-10-09 Peter Kim

它看起来像您的终端配置为ASCII。由于字符'\xe9'不在ASCII字符范围（0x00-0x7F）之内，因此无法在ASCII终端上打印。它还不能被编码为ASCII：

>>> s = '\xe9' 
>>> s.encode('ascii') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

您可以解决此通过明确编码在打印时的字符串，并用?更换不支持的字符处理编码错误：

>>> print(s.encode('ascii', errors='replace')) 
b'?'

字符看起来就像ISO-8859-1编码的小写字母e（e）一样。

您可以检查用于标准输出的编码。在我的情况下，它是UTF-8，和我没有问题，打印该字符：

>>> import sys 
>>> sys.stdout.encoding 
'UTF-8' 
>>> print('\xe9') 
é

你也许能够强迫的Python到使用不同的默认编码;有一些讨论here，但最好的方法是使用支持UTF-8的终端。

来源

2017-10-09 10:41:17 mhawke

这是我正在寻找的答案！谢谢。 –

Imdb评论编码错误

回答

相关问题