Hadoop的流调用Python脚本

CountWordOccurence_mapper.py

#!/usr/bin/env python 
import sys 

#print(sys.argv[1]) 

text = sys.argv[1] 

wordCount = text.count(sys.argv[2]) 

#print (sys.argv[2],wordCount) 
print '%s\t%s' % (sys.argv[2], wordCount)

PrintWordCount_reducer.py

#!/usr/bin/env python 
import sys 

finalCount = 0 

for line in sys.stdin: 
    line = line.strip() 
    word, count = line.split('\t') 

    count=int(count) 

    finalCount += count 

    print(word,finalCount)

我执行相同如下：

$ ./CountWordOccurence_mapper.py \ 
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \ 
    "Accord" \ 
    | /home/hadoopranch/omkar/PrintWordCount_reducer.py 
('Accord', 4)

正如所见，我的目标是愚蠢的 - 算不算。在给定的文本中出现提供的词（在这种情况下，Accord）。

现在，我打算使用Hadoop流式执行相同的操作。在HDFS（部分）的文本文件是：

"message" : "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." 
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it. Yesterday I tried to start it unsuccessfully. After hours at the auto mechanics it was found that there was a glitch in the electric/computer system. The news was disappointing enough (and expensive) but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful. When I bought a NEW Honda I thought I bought quality. I was wrong! Will Honda step up?"

我修改了CountWordOccurence_mapper.py

#!/usr/bin/env python 
import sys 

for text in sys.stdin: 

    wordCount = text.count(sys.argv[1]) 

    print '%s\t%s' % (sys.argv[1], wordCount)

我的第一个困惑是 - 如何发送到计数如“雅阁”这个词，“本田“作为映射器的参数（-cmdenv name = value）只会让我困惑。我还是说干就干，执行以下命令：

$HADOOP_HOME/bin/hadoop jar \ 
    $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \ 
-input /random_data/honda_service_within_warranty.txt \ 
-output /random_op/cnt.txt \ 
-file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \ 
-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \ 
-file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \ 
-reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

正如预期的那样，作业失败，我得到了以下错误：

Traceback (most recent call last): 
    File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module> 
    wordCount = text.count(sys.argv[1]) 
IndexError: list index out of range 
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) 
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) 
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) 
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) 
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) 
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:415) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

请纠正我犯的语法和基本的错误。

感谢和问候！

来源

2013-04-29 Kaliyug Antagonist

实际上，我曾尝试使用-cmdenv wordToBeCounted =“Accord”，但问题出在我的python文件上 - 我忘记更改它以从环境变量中读取值“Accord”（而不是从参数数组）。我附加代码为CountWordOccurence_mapper.py，以防万一有人想用它来做参考：

#!/usr/bin/env python 
import sys 
import os 

wordToBeCounted = os.environ['wordToBeCounted'] 

for text in sys.stdin: 

    wordCount = text.count(wordToBeCounted) 

    print '%s\t%s' % (wordToBeCounted,wordCount)

感谢和问候！

来源

2013-04-30 08:56:42

我觉得你的问题出在你的命令行调用的以下部分：

-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"

我觉得这里的假设是，字符串“协议”被作为第一个参数传递映射。我非常确定事实并非如此，实际上串流驱动程序入口点类（StreamJob.java）很可能忽略了字符串“Accord”。

要解决这个问题，你需要回到使用-cmdenv参数，然后在你的python代码中提取这个键/值对（我不是Python程序员，但我确信一个快速的Google会指出你朝着你需要的片段）。

来源

2013-04-30 01:00:53

我很方便忘记对我的python文件进行更改：感谢您的回答，我明白了：D – 2013-04-30 08:57:47

Hadoop的流调用Python脚本

回答

相关问题