设置输入分割不工作的映射器的Hadoop数

我想用不同数量的映射器和减速器多次运行hadoop工作。我已经设置了配置：设置输入分割不工作的映射器的Hadoop数

mapreduce.input.fileinputformat.split.maxsize

mapreduce.input.fileinputformat.split.minsize

mapreduce.job.maps

我的文件大小是1160421275，当我尝试在此代码中配置4个映射器和3个reducer时：

Configuration conf = new Configuration(); 
FileSystem hdfs = FileSystem.get(conf); 
long size = hdfs.getContentSummary(new Path("input/filea").getLength(); 
size+=hdfs.getContentSummary(new Path("input/fileb").getLength(); 
conf.set("mapreduce.input.fileinputformat.split.maxsize", String.valueOf((size/4))); 
conf.set("mapreduce.input.fileinputformat.split.minsize", String.valueOf((size/4))); 
conf.set("mapreduce.job.maps",4); 
.... 
job.setNumReduceTask(3);

尺寸/ 4给出290105318.作业的执行，给出以下输出：

2016-11-19 12:30:36,426 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1 
2016-11-19 12:30:36,535 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 4 
2016-11-19 12:30:36,572 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:7

分割的数目是7，而不是4，成功作业的输出是：

File System Counters 
    FILE: Number of bytes read=18855390277 
    FILE: Number of bytes written=14653469965 
    FILE: Number of read operations=0 
    FILE: Number of large read operations=0 
    FILE: Number of write operations=0 
Map-Reduce Framework 
    Map input records=39184416 
    Map output records=36751473 
    Map output bytes=787022241 
    Map output materialized bytes=860525313 
    Input split bytes=1801 
    Combine input records=0 
    Combine output records=0 
    Reduce input groups=25064998 
    Reduce shuffle bytes=860525313 
    Reduce input records=36751473 
    Reduce output records=1953960 
    Spilled Records=110254419 
    Shuffled Maps =21 
    Failed Shuffles=0 
    Merged Map outputs=21 
    GC time elapsed (ms)=1124 
    CPU time spent (ms)=0 
    Physical memory (bytes) snapshot=0 
    Virtual memory (bytes) snapshot=0 
    Total committed heap usage (bytes)=6126829568 
Shuffle Errors 
    BAD_ID=0 
    CONNECTION=0 
    IO_ERROR=0 
    WRONG_LENGTH=0 
    WRONG_MAP=0 
    WRONG_REDUCE=0 
File Input Format Counters 
    Bytes Read=0 
File Output Format Counters 
    Bytes Written=77643084

该地图显示它处理了21个混洗地图。我希望它只处理4个映射器。对于减速器，它提供总数为3的正确数量的文件。我的映射器分割大小设置是否错误？

来源

2016-11-19 mkvem

AFAIK那些confs的罚款。输入位置有多少个文件？ – mrsrinivas

对于文件A有1个文件，对于文件B有4个文件。 – mkvem

当我用9，它出来与10个分裂 – mkvem

我相信你正在使用TextInputFormat。

如果您有多个文件，那么至少会为每个文件生成一个映射器。如果文件大小（不是累积而是单个）大于块大小（通过设置最小和最大值进行调整），则会生成更多映射器。
尝试使用combineTextInputFormat，这将有助于你在acheiving你想要什么，但仍可能在InputFormats的逻辑不完全4.
的外观，您正在使用，以确定有多少映射器将催生。

来源

2017-05-23 16:09:15 KrazyGautam

设置输入分割不工作的映射器的Hadoop数

回答

相关问题