如何将EMR流作业的输出写入HDFS？

我看到examples的人写EMR输出到HDFS，但我一直没能找到它如何完成的例子。最重要的是，this documentation似乎表示，EMR流作业的--output参数必须是是S3存储桶。如何将EMR流作业的输出写入HDFS？

当我真的尝试运行一个脚本（在这种情况下，使用python streaming和mrJob）时，它会抛出一个“Invalid S3 URI”错误。

这里的命令：

python my_script.py -r emr \ 
--emr-job-flow-id=j-JOBID --conf-path=./mrjob.conf --no-output \ 
--output hdfs:///my-output \ 
hdfs:///my-input-directory/my-files*.gz

而回溯...

Traceback (most recent call last): 
    File "pipes/sampler.py", line 28, in <module> 
    SamplerJob.run() 
    File "/Library/Python/2.7/site-packages/mrjob/job.py", line 483, in run 
    mr_job.execute() 
    File "/Library/Python/2.7/site-packages/mrjob/job.py", line 501, in execute 
    super(MRJob, self).execute() 
    File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 146, in execute 
    self.run_job() 
    File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 206, in run_job 
    with self.make_runner() as runner: 
    File "/Library/Python/2.7/site-packages/mrjob/job.py", line 524, in make_runner 
    return super(MRJob, self).make_runner() 
    File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 161, in make_runner 
    return EMRJobRunner(**self.emr_job_runner_kwargs()) 
    File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 585, in __init__ 
    self._output_dir = self._check_and_fix_s3_dir(self._output_dir) 
    File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 776, in _check_and_fix_s3_dir 
    raise ValueError('Invalid S3 URI: %r' % s3_uri) 
ValueError: Invalid S3 URI: 'hdfs:///input/sample'

我如何写电子病历数据流作业，到HDFS的输出？它甚至有可能吗？

来源

2013-05-08 Abe

这是一个老问题，但可能仍然活跃。通过查看MrJob来源，EMRJobRunner只接受输出目的地的S3存储桶。由于您使用的是“长寿命”集群，因此可能会使用HadoopJobRunner（'-r hadoop'）来解决问题。尽管我无法实现工作解决方案... – 2016-03-03 14:09:12

它必须是S3存储桶，因为在作业完成后EMR群集不会正常保存。所以，坚持输出的唯一方法是在集群之外，下一个最接近的地方是S3。

来源

2013-05-25 00:15:48 kgu87

我在“保持活动”模式下运行作业流程，因此结果可以在作业流程步骤之间的HDFS中保留。我的作业结构需要使用相同的（大型）数据集作为流程中许多步骤的输入。如果数据存储在HDFS中，而不是在每一步中从S3重新下载数据，这将节省大量时间。 – Abe 2013-05-25 20:24:44

我明白了。我不是Python专家，但MRJobRunner（EMRJobRunner的超级代码）代码的代码似乎表明，您不需要在输出参数中指定'hdfs：//'，只需指定位置即可https://github.com /Yelp/mrjob/blob/master/mrjob/emr.py – kgu87 2013-05-25 20:49:24

我不知道它如何使用mrJob来完成，但与Hadoop和streaming jobs written in java，我们做如下：

启动集群
从获取数据S3使用s3distcp到HDFS群集
与输入执行我们的工作的步骤1作为HDFS
执行步骤2或我们用相同的输入如上工作...

使用EMR CLI，我们做如下：

> export jobflow=$(elastic-mapreduce --create --alive --plain-output 
> --master-instance-type m1.small --slave-instance-type m1.xlarge --num-instances 21 --name "Custer Name" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args 
> "--mapred-config-file,s3://myBucket/conf/custom-mapred-config-file.xml") 
> 
> 
> elastic-mapreduce -j $jobflow --jar 
> s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar 
> --arg --src --arg 's3://myBucket/input/' --arg --dest --arg 'hdfs:///input' 
> 
> elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step1.jar 
> --arg hdfs:///input --arg hdfs:///output-step1 --step-name "Step 1" 
> 
> elastic-mapreduce --jobflow $jobflow --jar s3://myBucket/bin/step2.jar 
> --arg hdfs:///input,hdfs:///output-step1 --arg s3://myBucket/output/ --step-name "Step 2"

来源

2013-05-30 20:36:01 Amar

保存MRJob EMR作业的输出是目前不可能的。目前在https://github.com/Yelp/mrjob/issues/887有一个开放式的要求。

来源

2015-05-22 22:12:06 user4930682

如何将EMR流作业的输出写入HDFS？

回答

相关问题