解析日本Python

-1

*****编辑完整代码******解析日本Python

我想解析一些日语代码使用Python（版本3.5.3）和MacOS上的MeCab库。

我有以下文本TXT文件：

石の上に三年

设置我在我的文字编辑的喜好使用UTF-8进行保存。所以我相信系统可以正确的将它保存为utf-8格式。

我得到了以下错误：

Traceback (most recent call last): File "japanese.py", line 29, in <module> 
    words = extractMetadataFromTXT(fileName) File "japanese.py", line 14, in extractMetadataFromTXT 
    md = extractWordsJP(data) File "japanese.py", line 22, in extractWordsJP 
    components.append(parsed.surface) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte

娄进我的全部代码。没有失踪。

import MeCab 
import nltk 
from nltk import * 
from nltk.corpus import knbc 

mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd") 
wordsList = knbc.words() 
fdist = nltk.FreqDist(w.lower() for w in wordsList) 

def extractMetadataFromTXT(filePath): 
    with open(filePath, 'r', encoding='utf-8') as f: 
     data = f.read() 
     print(data) 
    md = extractWordsJP(data) 
    print(md) 
    return md 

def extractWordsJP(wordsJP): 
    components = [] 
    parsed = mt.parseToNode(wordsJP) 
    while parsed: 
     components.append(parsed.surface) 
     parsed = parsed.next 
    return components 

if __name__ == "__main__": 
    fileName = "simple_japanese.txt" 
    words = extractMetadataFromTXT(fileName) 
    print(words)

有没有人有任何线索为什么我得到这个错误信息？

有趣的事实：有时它有效。：o

由于提前，

以色列

来源

2017-06-19 israel.zinc

错误只能由编码问题引起，所以您的TextEdit设置可能不起作用。从shell中，用输入文件“cd”到目录并键入'file simple_japanese.txt'。这应该说'UTF-8 Unicode文本'。 – polm23

解决方案：

显然，问题是与仲裁处，不与Python代码本身。这个问题是，当你从零开始安装它时，使用make，有时它不能正确安装，但它不会引发任何错误。

我不知道为什么，但如果您想进一步挖掘并找出究竟发生了什么，那将会很棒。我只知道我使用brew卸载并重新安装，并且工作正常。

类似的事情发生在办公室的其他Mac电脑上。我正在使用OS X酿造，所以我会发布命令我用正确安装：也

brew install mecab mecab-ipadic git curl xz

，在Linux上安装它，使用下面的命令：

sudo apt-get install mecab libmecab-dev mecab-ipadic 
sudo apt-get install mecab-ipadic-utf8 
sudo apt-get install python-mecab

希望这帮助未来的人试图标记日文单词。

来源

2017-08-10 02:48:15

当您打开该文件，指定编码：

with open(file, 'r', encoding='utf-8') as f: 
    data = f.read() 

...

顺便说一句，在打开文件时，使用context manager如图这个例子。

来源

2017-06-19 08:49:14

TypeError：需要一个整数（得到的类型为str） –

这应该读取'encoding ='utf-8''，尽管它需要Python 3并且可能不能解决问题。 –

谢谢Yann，我错过了encoding ='utf-8'部分。但是，正如你所预料的那样，它并没有解决问题。它有时会起作用，但事实并非如此。我需要稳定而不是随机的东西。 –

错误发生的原因是您正在向UTF-8解码器提供无效的UTF-8。这可能是由分割字节而不是字符引起的，或者可能是因为错误地试图解码另一种编码，如JIS或EUC，就好像它是UTF-8一样。在Python中，坚持使用unicode字符串通常听起来很合理，并且如果设置了locale参数，则系统可能会切换到解码文本文件。即使你有适当的unicode字符串分割是一个不平凡的问题，因为有代码可以修改其他字符，如重音符号。日本人并没有太多这类的东西，幸运的是（除非有人碰巧编码为ho + ring等）。

一个潜在的问题：Mecab的网页状态（每谷歌翻译）“除非另有规定，使用euc。”如果Mecab在正在读EUC的假设下进行分词，它将破坏UTF-8。

来源

2017-06-19 09:12:03

我将textEdit的配置设置为仅保存在utf-8中，并且错误仍然存在。 :( –

解析日本Python

回答

相关问题