Hadoop和NLTK：停用词汇失败

我试图在Hadoop上运行Python程序。该计划涉及NLTK图书馆。该程序还使用Hadoop Streaming API，如here所述。Hadoop和NLTK：停用词汇失败

mapper.py：

#!/usr/bin/env python 
import sys 
import nltk 
from nltk.corpus import stopwords 

#print stopwords.words('english') 

for line in sys.stdin: 
     print line,

reducer.py：

#!/usr/bin/env python 

import sys 
for line in sys.stdin: 
    print line,

控制台命令：

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

这将运行perfe ctly，输出只包含输入文件的行。

然而，当该线路（从mapper.py）：

#PRINT stopwords.words（ '英语'）

是未注释，则程序失败，并且说

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

我已经检查并在独立的python程序中，

print stopwords.words('english')

完美地工作，所以我绝对难以理解为什么它导致我的Hadoop程序失败。

我将不胜感激任何帮助！谢谢

来源

2013-09-27 Objc55

您的hadoop目录中没有ntlk语料库。试试这个 http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming – user1525721

试试这个--- http： //stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job – user1525721

@ user1525721感谢您的答复。将尝试并回发。如果我在所有节点上都有NLTK，这是否仍然有必要？ – Objc55

是'英文'print stopwords.words('english')中的文件？如果是的话，你也需要使用-file来将它发送到节点。

来源

2013-09-30 22:07:21

使用这些命令解压：

importer = zipimport.zipimporter('nltk.zip') 
    importer2=zipimport.zipimporter('yaml.zip') 
    yaml = importer2.load_module('yaml') 
    nltk = importer.load_module('nltk')

检查我在上面粘贴的链接。他们提到了所有的步骤。

来源

2013-09-27 23:56:28 user1525721

我是否需要通过控制台命令发送这些文件，还是将它们存储在每台计算机上的本地？另外，我需要nltk.zip还是nltk_data.zip？我怎样才能找到前者？ yaml在这方面扮演什么角色？谢谢！ – Objc55

我尝试了你的建议，并导入nltk和yaml没有任何问题。但是，我仍然无法使用停用词。 '从nltk.corpus导入stopwords'不会导致程序失败，但只要输入'print stopwords.words（'english'）'，它就会失败。任何想法如何解决？我已经在控制台命令中加入了这个：'-archives。/ stopwords.zip'谢谢！ – Objc55

Hadoop和NLTK：停用词汇失败

回答

相关问题