我有两个小的Python脚本Hadoop的流调用Python脚本
CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
#print(sys.argv[1])
text = sys.argv[1]
wordCount = text.count(sys.argv[2])
#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)
PrintWordCount_reducer.py
#!/usr/bin/env python
import sys
finalCount = 0
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t')
count=int(count)
finalCount += count
print(word,finalCount)
我执行相同如下:
$ ./CountWordOccurence_mapper.py \
"I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
"Accord" \
| /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)
正如所见,我的目标是愚蠢的 - 算不算。在给定的文本中出现提供的词(在这种情况下,Accord)。
现在,我打算使用Hadoop流式执行相同的操作。在HDFS(部分)的文本文件是:
"message" : "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it. Yesterday I tried to start it unsuccessfully. After hours at the auto mechanics it was found that there was a glitch in the electric/computer system. The news was disappointing enough (and expensive) but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful. When I bought a NEW Honda I thought I bought quality. I was wrong! Will Honda step up?"
我修改了CountWordOccurence_mapper.py
#!/usr/bin/env python
import sys
for text in sys.stdin:
wordCount = text.count(sys.argv[1])
print '%s\t%s' % (sys.argv[1], wordCount)
我的第一个困惑是 - 如何发送到计数如“雅阁”这个词,“本田“作为映射器的参数(-cmdenv name = value)只会让我困惑。我还是说干就干,执行以下命令:
$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
-input /random_data/honda_service_within_warranty.txt \
-output /random_op/cnt.txt \
-file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
-file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
-reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py
正如预期的那样,作业失败,我得到了以下错误:
Traceback (most recent call last):
File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
请纠正我犯的语法和基本的错误。
感谢和问候!
我很方便忘记对我的python文件进行更改: 感谢您的回答,我明白了:D – 2013-04-30 08:57:47