0
我试图建立一个RNN模型,将评论分为正面或负面情绪。Imdb评论编码错误
有一个词汇的词汇,在预处理过程中,我对一些索引序列进行了回顾。
例如,
"This movie was best" --> [2,5,10,3]
当我试图让频繁vocabs并查看其内容,我得到这个错误:
num of reviews 100
number of unique tokens : 4761
Traceback (most recent call last):
File "preprocess.py", line 47, in <module>
print(vocab)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)
代码如下所示:
from bs4 import BeautifulSoup
reviews = []
for item in os.listdir('imdbdata/train/pos')[:100]:
with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f:
sample = BeautifulSoup(f.read()).get_text()
sample = word_tokenize(sample.lower())
reviews.append(sample)
print("num of reviews", len(reviews))
word_freq = nltk.FreqDist(itertools.chain(*reviews))
print("number of unique tokens : %d"%(len(word_freq.items())))
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict((w,i) for i,w in enumerate(index_to_word))
print(vocab)
问题是,当我用Python处理NLP问题时,如何才能摆脱这个UnicodeEncodeError
?特别是在使用open
函数获取文本时。
这是我正在寻找的答案!谢谢。 –