我看到examples的人写EMR输出到HDFS,但我一直没能找到它如何完成的例子。最重要的是,this documentation似乎表示,EMR流作业的--output参数必须是是S3存储桶。如何将EMR流作业的输出写入HDFS?
当我真的尝试运行一个脚本(在这种情况下,使用python streaming和mrJob)时,它会抛出一个“Invalid S3 URI”错误。
这里的命令:
python my_script.py -r emr \
--emr-job-flow-id=j-JOBID --conf-path=./mrjob.conf --no-output \
--output hdfs:///my-output \
hdfs:///my-input-directory/my-files*.gz
而回溯...
Traceback (most recent call last):
File "pipes/sampler.py", line 28, in <module>
SamplerJob.run()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 483, in run
mr_job.execute()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 206, in run_job
with self.make_runner() as runner:
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 524, in make_runner
return super(MRJob, self).make_runner()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 161, in make_runner
return EMRJobRunner(**self.emr_job_runner_kwargs())
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 585, in __init__
self._output_dir = self._check_and_fix_s3_dir(self._output_dir)
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 776, in _check_and_fix_s3_dir
raise ValueError('Invalid S3 URI: %r' % s3_uri)
ValueError: Invalid S3 URI: 'hdfs:///input/sample'
我如何写电子病历数据流作业,到HDFS的输出?它甚至有可能吗?
这是一个老问题,但可能仍然活跃。通过查看MrJob来源,EMRJobRunner只接受输出目的地的S3存储桶。由于您使用的是“长寿命”集群,因此可能会使用HadoopJobRunner('-r hadoop')来解决问题。尽管我无法实现工作解决方案... – 2016-03-03 14:09:12