2016-09-16 84 views
0

我使用nltk模块制作了用于检测句子极性的自定义语料库。这里是语料库的层次:UnicodeDecodeError在NLTK中读取自定义创建的语料库时

极性
--polar
---- polar_tweets.txt
--nonpolar
---- nonpolar_tweets.txt

这里是如何我导入一个语料库在我的源代码:

polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8') 
corpus = polarity 
print(corpus.words(fileids=['nonpolar/non-polar.txt'])) 

,但它提出了以下错误:

Traceback (most recent call last): 
    File "E:/Analytics Practice/Social Media Analytics/analyticsPlatform/DataAnalysis/SentimentAnalysis/data/training_testing_data.py", line 9, in <module> 
    print(corpus.words(fileids=['nonpolar/nonpolar_tweets.txt'])) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\util.py", line 765, in __repr__ 
    for elt in self: 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from 
    tokens = self.read_block(self._stream) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block 
    words.extend(self._word_tokenizer.tokenize(stream.readline())) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1135, in readline 
    new_chars = self._read(readsize) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1367, in _read 
    chars, bytes_decoded = self._incr_decode(bytes) 
    File "E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\data.py", line 1398, in _incr_decode 
    return self.decode(bytes, 'strict') 
    File "C:\Users\prabhjot.rai\AppData\Local\Continuum\Anaconda3\lib\encodings\utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 269: invalid continuation byte 

在创建文件polar_tweets.txtnonpolar_tweets.txt,我解码文件uncleaned_polar_tweets.txtutf-8,然后将其写入文件polar_tweets.txt。下面是该代码:

with open(path_to_file, "rb") as file: 
    output_corpus = clean_text(file.read().decode('utf-8'))['cleaned_corpus'] 

output_file = open(output_path, "w") 
output_file.write(output_corpus) 
output_file.close() 

其中output_file是polar_tweets.txtnonpolar_tweets.txt。 错误在哪里?因为我在utf-8编码开始,然后也由线

polarity = LazyCorpusLoader('polar', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(polar|nonpolar)/.*', encoding='utf-8') 

阅读utf-8如果我通过encoding='latin-1'更换encoding='utf-8',代码工作完美。问题在哪里?在创建语料库时,我还需要在utf-8中解码吗?

+0

您的术语已关闭。阅读时,你从*解码*。错误表明,语料库(或其中的一部分)不是有效的UTF-8。如果没有访问有问题的数据,我们只能推测。什么'LC_ALL = C grep -m 1 $'\ xC2'nonpolar_tweets.txt'产生? (也许管道到'xxd'或类似的精确查看字节。) – tripleee

+0

...或在Python中的等价物 - 读取违规行,然后检查它的'repr()' – tripleee

回答

1

您需要了解的是,在Python的模型中,unicode是一种数据,但utf-8编码。他们不是一回事。你正在阅读你的原始文本,这显然在utf-8;清理它,然后将其写入新的语料库而不指定编码。所以你把它写出来......谁知道什么编码。不要发现,只需清理并再次生成指定utf-8编码的语料库。

我希望你在Python 3中做到了这一切;如果没有,就在这里停下来,然后切换到Python 3.然后写出这样的语料库:

output_file = open(output_path, "w", encoding="utf-8") 
output_file.write(output_corpus) 
output_file.close() 
+0

感谢您的澄清:) –