我试图在Hadoop上运行Python程序。该计划涉及NLTK图书馆。该程序还使用Hadoop Streaming API,如here所述。Hadoop和NLTK:停用词汇失败
mapper.py:
#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords
#print stopwords.words('english')
for line in sys.stdin:
print line,
reducer.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line,
控制台命令:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
这将运行perfe ctly,输出只包含输入文件的行。
然而,当该线路(从mapper.py):
#PRINT stopwords.words( '英语')
是未注释,则程序失败,并且说
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
我已经检查并在独立的python程序中,
print stopwords.words('english')
完美地工作,所以我绝对难以理解为什么它导致我的Hadoop程序失败。
我将不胜感激任何帮助!谢谢
您的hadoop目录中没有ntlk语料库。 试试这个 http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming – user1525721
试试这个--- http: //stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job – user1525721
@ user1525721感谢您的答复。将尝试并回发。如果我在所有节点上都有NLTK,这是否仍然有必要? – Objc55