*****编辑完整代码******解析日本Python
我想解析一些日语代码使用Python(版本3.5.3)和MacOS上的MeCab库。
我有以下文本TXT文件:
石の上に三年
设置我在我的文字编辑的喜好使用UTF-8进行保存。所以我相信系统可以正确的将它保存为utf-8格式。
我得到了以下错误:
Traceback (most recent call last): File "japanese.py", line 29, in <module>
words = extractMetadataFromTXT(fileName) File "japanese.py", line 14, in extractMetadataFromTXT
md = extractWordsJP(data) File "japanese.py", line 22, in extractWordsJP
components.append(parsed.surface) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte
娄进我的全部代码。没有失踪。
import MeCab
import nltk
from nltk import *
from nltk.corpus import knbc
mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
wordsList = knbc.words()
fdist = nltk.FreqDist(w.lower() for w in wordsList)
def extractMetadataFromTXT(filePath):
with open(filePath, 'r', encoding='utf-8') as f:
data = f.read()
print(data)
md = extractWordsJP(data)
print(md)
return md
def extractWordsJP(wordsJP):
components = []
parsed = mt.parseToNode(wordsJP)
while parsed:
components.append(parsed.surface)
parsed = parsed.next
return components
if __name__ == "__main__":
fileName = "simple_japanese.txt"
words = extractMetadataFromTXT(fileName)
print(words)
有没有人有任何线索为什么我得到这个错误信息?
有趣的事实:有时它有效。 :o
由于提前,
以色列
错误只能由编码问题引起,所以您的TextEdit设置可能不起作用。从shell中,用输入文件“cd”到目录并键入'file simple_japanese.txt'。这应该说'UTF-8 Unicode文本'。 – polm23