2017-03-04 78 views
1

我在YARN集群上运行Spark应用程序(在AWS EMR上)。该应用程序似乎被杀害,我想找到原因。我尝试了解下面屏幕中给出的YARN信息。 enter image description here 屏幕上的诊断线似乎表明,由于内存限制,YARN杀死应用程序:如何理解纱线appattempt日志和诊断?

Diagnostics: Container [pid=1540,containerID=container_1488651686158_0012_02_000001] is running beyond physical memory limits. Current usage: 1.6 GB of 1.4 GB physical memory used; 3.6 GB of 6.9 GB virtual memory used. Killing container

然而,appattempt日志显示完全不同的例外,一些相关的IO /网络。我的问题是:我应该相信屏幕上的诊断或者appattempt日志吗?导致kill或内存不足的IO异常是否导致appattempt日志中的IO异常?这是我应该看的另一个日志/诊断吗?谢谢。

17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread task-result-getter-0 
java.lang.InterruptedException 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) 
     at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) 
     at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) 
     at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 
     at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) 
     at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 
     at scala.concurrent.Await$.result(package.scala:190) 
     at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) 
     at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104) 
     at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) 
     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Exception in thread "task-result-getter-0" java.lang.Error: java.lang.InterruptedException 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
     at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.InterruptedException 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) 
     at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) 
     at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) 
     at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 
     at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) 
     at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 
     at scala.concurrent.Await$.result(package.scala:190) 
     at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) 
     at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104) 
     at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) 
     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) 
     at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     ... 2 more 
17/03/04 21:59:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 
17/03/04 21:59:02 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-9-207.ec2.internal/172.31.9.207:38437 is closed 
17/03/04 21:59:02 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms 
17/03/04 21:59:02 ERROR DiskBlockManager: Exception while deleting local spark dir: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca 
java.io.IOException: Failed to delete: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca 
     at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) 
     at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:169) 
     at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165) 
     at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 
     at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:165) 
     at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:160) 
     at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1361) 
     at org.apache.spark.SparkEnv.stop(SparkEnv.scala:89) 
     at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842) 
     at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) 
     at org.apache.spark.SparkContext.stop(SparkContext.scala:1841) 
     at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) 
     at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at scala.util.Try$.apply(Try.scala:192) 
     at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) 
     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 
17/03/04 21:59:02 INFO MemoryStore: MemoryStore cleared 
17/03/04 21:59:02 INFO BlockManager: BlockManager stopped 
17/03/04 21:59:02 INFO BlockManagerMaster: BlockManagerMaster stopped 
17/03/04 21:59:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 
17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread Thread-3 
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder 
     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
     at java.nio.file.Paths.get(Paths.java:138) 
     at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021) 
     at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991) 
     at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102) 
     at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842) 
     at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) 
     at org.apache.spark.SparkContext.stop(SparkContext.scala:1841) 
     at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) 
     at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at scala.util.Try$.apply(Try.scala:192) 
     at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) 
     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 
17/03/04 21:59:02 WARN ShutdownHookManager: ShutdownHook '$anon$2' failed, java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder 
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder 
     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
     at java.nio.file.Paths.get(Paths.java:138) 
     at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021) 
     at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991) 
     at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102) 
     at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842) 
     at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) 
     at org.apache.spark.SparkContext.stop(SparkContext.scala:1841) 
     at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) 
     at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) 
     at scala.util.Try$.apply(Try.scala:192) 
     at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) 
     at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) 
     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 

回答

0

容器被杀(内存超过物理内存限制),因此任何尝试到达此容器的尝试都失败。

纱线细有过程的全貌,但你应该更喜欢火花历史服务器来更好地分析你的工作(检查火花历史不平衡内存)。

a well balanced memory stage

+0

我没有看到一个包含“不平衡记忆”任何日志消息。我应该在日志中搜索什么字符串来监视不平衡内存? – sgu

+0

这不是当前的输出,我是指火花历史。你应该下去每个工作/阶段(失败时)并检查一个特定的执行者/任务是否接收到所有的数据。 – glefait

0

在截图中的信息是最相关的。您的ApplicationMaster容器内存不足。您需要增加在mapred-site.xml中设置的yarn.app.mapreduce.am.resource.mb。我推荐一个2000的值,因为这通常会适应大规模运行的Spark和MapReduce应用程序。

+0

默认情况下,AWS-EMR将yarn.app.mapreduce.am.resource.mb设置为2880。我试图将其设置为7000的更大值,但仍然失败。 – sgu

+0

您的spark.driver.memory设置为? –