使用sklearn的解码/编码load_files

我正在按照教程 https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb 了解机器学习和文本。使用sklearn的解码/编码load_files

就我而言，我正在使用我下载的推文，在他们正在使用的完全相同的目录结构（尝试学习情感分析）中使用正面和负面的推文。

在这里，在IPython的笔记本我打开我的数据，就像他们做的事：

tweets_train =load_files('Path to my training Tweets')

然后我尝试用CountVectorizer适合他们

vect = CountVectorizer().fit(text_train)

我得到

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 561: invalid continuation byte

这是因为我的Tweets中有各种非标准文字吗？我没有做我的鸣叫的任何清理（我假设有一些与帮助，以使单词一袋工作库？）

编辑：

：我用用Twython下载鸣叫代码

def get_tweets(user): 
    twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET) 
    user_timeline = twitter.get_user_timeline(screen_name=user,count=1) 
    lis = user_timeline[0]['id'] 
    lis = [lis] 
    for i in range(0, 16): ## iterate through all tweets 
    ## tweet extract method with the last list item as the max_id 
     user_timeline = twitter.get_user_timeline(screen_name=user, 
     count=200, include_retweets=False, max_id=lis[-1]) 
     for tweet in user_timeline: 
      lis.append(tweet['id']) ## append tweet id's 
      text = str(tweet['text']).replace("'", "") 
      text_file = open(user, "a") 
      text_file.write(text) 
      text_file.close()

来源

2017-05-26 Amanda_Panda

这意味着您要么使用UTF-8以外的编码存储数据，要么数据以某种方式损坏。请提供有关如何下载并将推文保存到磁盘的详细信息（=代码）。 – lenz

请参阅编辑代码以下载推文。 –

你也可以显示你如何从'tweets_train'到'text_train'？ – lenz

您将得到一个UnicodeDecodeError，因为您的文件正在使用错误的文本编码进行解码。如果这对您来说毫无意义，请确保您了解Unicode和文本编码的基础知识，例如。与official Python Unicode HOWTO。

首先，您需要找出用于在磁盘上存储推文的编码。当您将它们保存到文本文件中时，您使用内置的open函数而不指定编码。这意味着使用了系统的默认编码。检查这一点，例如，在交互式会话：

>>> f = open('/tmp/foo', 'a') 
>>> f 
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>

在这里你可以看到，在我的本地环境的缺省编码设置为UTF-8。您也可以直接与检查

>>> import sys 
>>> sys.getdefaultencoding() 
'utf-8'

的默认编码还有其他的方法，找出使用的是什么编码的文件。例如，如果您碰巧在Unix平台上工作，Unix工具file就非常适合猜测现有文件的编码。

一旦你认为你知道使用的编码写文件，你可以在load_files()功能指定此：

tweets_train = load_files('path to tweets', encoding='latin-1')

...如果你发现的Latin-1是为编码用于推文;否则相应调整。

来源

2017-05-26 13:28:15 lenz

谢谢，我今天下午回家时会尝试一下你的建议。 –

如果它不起作用，请尝试'encoding = ...'CountVectorizer（）'构造函数中的''参数，而不是'load_files（）'函数。 – lenz

谢谢！你让我指出了正确的方向，我最终发现latin-1是我需要的编码（它在f中打开）。 –

使用sklearn的解码/编码load_files

回答

相关问题