2011-12-14 52 views
1

我正在使用hadoop中mapreduce的矩阵乘法示例。我想问一下,如果溢出的记录总是等于mapinput和mapoutput记录。我有 洒从mapinput和mapoutput记录不同的记录应该溢出的记录总是等于MapReduce中的mapinput记录或mapoutput记录使用hadoop?

这里是一个测试的输出我得到:

Three by three test 
    IB = 1 
    KB = 2 
    JB = 1 
11/12/14 13:16:22 INFO input.FileInputFormat: Total input paths to process : 2 
11/12/14 13:16:22 INFO mapred.JobClient: Running job: job_201112141153_0003 
11/12/14 13:16:23 INFO mapred.JobClient: map 0% reduce 0% 
11/12/14 13:16:32 INFO mapred.JobClient: map 100% reduce 0% 
11/12/14 13:16:44 INFO mapred.JobClient: map 100% reduce 100% 
11/12/14 13:16:46 INFO mapred.JobClient: Job complete: job_201112141153_0003 
11/12/14 13:16:46 INFO mapred.JobClient: Counters: 17 
11/12/14 13:16:46 INFO mapred.JobClient: Job Counters 
11/12/14 13:16:46 INFO mapred.JobClient:  Launched reduce tasks=1 
11/12/14 13:16:46 INFO mapred.JobClient:  Launched map tasks=2 
11/12/14 13:16:46 INFO mapred.JobClient:  Data-local map tasks=2 
11/12/14 13:16:46 INFO mapred.JobClient: FileSystemCounters 
11/12/14 13:16:46 INFO mapred.JobClient:  FILE_BYTES_READ=1464 
11/12/14 13:16:46 INFO mapred.JobClient:  HDFS_BYTES_READ=528 
11/12/14 13:16:46 INFO mapred.JobClient:  FILE_BYTES_WRITTEN=2998 
11/12/14 13:16:46 INFO mapred.JobClient:  HDFS_BYTES_WRITTEN=384 
11/12/14 13:16:46 INFO mapred.JobClient: Map-Reduce Framework 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce input groups=36 
11/12/14 13:16:46 INFO mapred.JobClient:  Combine output records=0 
11/12/14 13:16:46 INFO mapred.JobClient:  Map input records=18 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce shuffle bytes=735 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce output records=15 
11/12/14 13:16:46 INFO mapred.JobClient:  Spilled Records=108 
11/12/14 13:16:46 INFO mapred.JobClient:  Map output bytes=1350 
11/12/14 13:16:46 INFO mapred.JobClient:  Combine input records=0 
11/12/14 13:16:46 INFO mapred.JobClient:  Map output records=54 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce input records=54 
11/12/14 13:16:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1 
11/12/14 13:16:46 INFO mapred.JobClient: Running job: job_local_0001 
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1 
11/12/14 13:16:46 INFO mapred.MapTask: io.sort.mb = 100 
11/12/14 13:16:46 INFO mapred.MapTask: data buffer = 79691776/99614720 
11/12/14 13:16:46 INFO mapred.MapTask: record buffer = 262144/327680 
11/12/14 13:16:46 INFO mapred.MapTask: Starting flush of map output 
11/12/14 13:16:46 INFO mapred.MapTask: Finished spill 0 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.Merger: Merging 1 sorted segments 
11/12/14 13:16:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 128 bytes 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 
11/12/14 13:16:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/MatrixMultiply/out 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: reduce > reduce 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 
11/12/14 13:16:47 INFO mapred.JobClient: map 100% reduce 100% 
11/12/14 13:16:47 INFO mapred.JobClient: Job complete: job_local_0001 
11/12/14 13:16:47 INFO mapred.JobClient: Counters: 14 
11/12/14 13:16:47 INFO mapred.JobClient: FileSystemCounters 
11/12/14 13:16:47 INFO mapred.JobClient:  FILE_BYTES_READ=89412 
11/12/14 13:16:47 INFO mapred.JobClient:  HDFS_BYTES_READ=37206 
11/12/14 13:16:47 INFO mapred.JobClient:  FILE_BYTES_WRITTEN=37390 
11/12/14 13:16:47 INFO mapred.JobClient:  HDFS_BYTES_WRITTEN=164756 
11/12/14 13:16:47 INFO mapred.JobClient: Map-Reduce Framework 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce input groups=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Combine output records=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Map input records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce shuffle bytes=0 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce output records=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Spilled Records=18 
11/12/14 13:16:47 INFO mapred.JobClient:  Map output bytes=180 
11/12/14 13:16:47 INFO mapred.JobClient:  Combine input records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Map output records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce input records=9 
...........X[0][0]=30, Y[0][0]=9 
Bad Answer 
...........X[0][1]=36, Y[0][1]=36 
...........X[0][2]=42, Y[0][2]=42 
...........X[1][0]=66, Y[1][0]=24 
Bad Answer 
...........X[1][1]=81, Y[1][1]=81 
...........X[1][2]=96, Y[1][2]=96 
...........X[2][0]=102, Y[2][0]=39 
Bad Answer 
...........X[2][1]=126, Y[2][1]=126 
...........X[2][2]=150, Y[2][2]=150 

这个例子与代码一起说明如下:

http://www.norstad.org/matrix-multiply/index.html

请问您能否告诉我该问题在哪里,如何才能正确使用?由于

WL

+0

我也想提及,虽然在独立模式下运行,但它在溢出记录等于地图输入和输出记录(这是18)时工作正常,但在伪分布模式下它不起作用,溢出记录不等于mapinput和mapoutput记录。 – waqas 2011-12-14 12:48:14

+2

溢出的意思是,它们必须溢出到磁盘,因为RAM在分类/洗牌阶段不够用。所以这应该是最好的或非常低的零。 – 2011-12-14 12:58:40

回答

4

根据的Hadoop权威指南“溢出记录”统计的是经作业的过程中溢出到磁盘,包括地图和减少侧溢出记录的总数。 “溢出记录”计数可能为零,这非常好。一般来说,溢出的记录意味着你已经超过了地图输出缓冲区中可用的内存量。拥有少量的“溢出记录”通常不是问题。在您的mapred-site.xml中,可用RAM的设置为io.sort.mbio.sort.spill.percent。如果表现是一个问题,你会想调整这些以尽量减少溢出的记录。演示文稿Optimizing MapReduce Job Performance有更多细节,特别是幻灯片#12和#13。如果您不止一次泄漏,则由于需要合并泄漏,您需要支付IO的3倍处罚。如果“溢出记录”多于“地图输出记录”+“减少输出记录”,那么您正在做多个溢出。请注意,RAM的数量最终会受到Java VM的堆大小的限制,因此您可能需要增加群集大小或增加映射任务的数量,方法是增加给定作业的输入分割以减少泄漏的数量。

在您的具体示例中,“溢出记录”较大,因此您不止一次溢出。