在新的Hadoop API中如何递归地使用目录结构？

/indir/somedir1/somefile 
/indir/somedir1/someotherfile... 
/indir/somedir2/somefile 
/indir/somedir2/someotherfile...

我现在想通过递归一切都变成MR的工作，我使用新的API。所以我做：

FileInputFormat.setInputPaths(job, new Path("/indir"));

但作业失败：

Error: java.io.FileNotFoundException: Path is not a file: /indir/somedir1

我使用Hadoop 2.4和this post中指出，Hadoop的2新的API不支持递归文件。但我想知道这是怎么回事，因为我认为这是世界上最普通的东西在Hadoop作业中抛出一个大型的嵌套目录结构...

所以，这是打算，还是这是一个错误？在两种方式中，是否还有另一种解决方法，比使用旧的API？

2014-10-30 rabejens

我自己找到了答案。在所提到的论坛帖子链接的JIRA，还有它是如何做正确的两点意见：

设置mapreduce.input.fileinputformat.input.dir.recursive到true（注释状态mapred.input.dir.recursive但已过时）
使用FileInputFormat.addInputPath指定输入目录

随着这些变化，它的工作原理。

2014-10-30 09:03:47 rabejens

好的发现兄弟。你为什么不接受答案并加以封印！ – blackSmith 2014-10-30 09:12:45

因为StackOverflow只允许我在两天内接受答案。 – rabejens 2014-10-30 13:37:12

另一种配置方法是通过FileInputFormat类。

FileInputFormat.setInputDirRecursive(job, true);

2016-07-13 13:02:40

回答