为什么在使用spark-sql并行执行多个Hive脚本时Spark作业失败？

我有25个hive脚本，每个脚本有200个hive查询。我在我的aws emr集群中使用spark-sql命令运行每个hql。我正在运行所有spark-sql命令并行使用&运算符。我能够在tez上使用配置单元成功运行相同的hqls。我正在尝试使用spark-sql来提高性能。但是，使用spark-sql只有2-3个脚本执行正常;剩余的hqls与连接由对等错误设置失败。我相信这是因为纱线集群中缺乏火花资源。为什么在使用spark-sql并行执行多个Hive脚本时Spark作业失败？

当我观察YARN控制台时，即使我在命令中指定了执行程序和驱动程序内存，我仍可以看到它正在利用群集的全部内存。

有人能帮我找出这个问题的确切原因吗？

下面是我的EMR集群配置：

Data Nodes : 6 RAM per Node : 56 GB Cores per Node: 32 Instance Type: M4*4xLarge

命令在UNIX中使用：

spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f hql1.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f hql2.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f hql3.hql ..... spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f hql25.hql

当运行在并行所有上述命令只有2至3作业是否正确执行和其余是以下错误失败。

05:> (0 + 0)/30800]^M[Stage 904:=> (6818 + 31)/30800][Stage 905:> (0 + 0)/30800]^M[Stage 904:==> (7743 + 31)/30800][Stage 905:> (0 + 0)/30800]^M[Stage 904:==> (8271 + 32)/30800][Stage 905:> (0 + 0)/30800]17/04/13 11:35:10 WARN TransportChannelHandler: Exception in connection from /10.134.22.114:47550 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:745) 17/04/13 11:35:10 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.134.22.114:47550 is closed 17/04/13 11:35:10 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(53329,61600,Map(ip-10-134-22-6.eu-central-1.compute.internal -> 12262, ip-10-134-22-67.eu-central-1.compute.internal -> 16940, ip-10-134-22-106.eu-central-1.compute.internal -> 17876, ip-10-134-22-46.eu-central-1.compute.internal -> 16400, ip-10-134-22-114.eu-central-1.compute.internal -> 14902, ip-10-134-22-105.eu-central-1.compute.internal -> 44820)) to AM was unsuccessful java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)

来源

2017-04-20 Vinay Kumar Dudi

我相信这是因为在纱线簇火花缺乏资源。

我也这么认为，并强烈建议使用YARN UI来查看资源是如何使用的。

无论您在YARN UI中看到什么，我都做了一些计算，并且出现，您确实拥有的资源太少，无法同时运行所有25个脚本。

鉴于...

Data Nodes : 6 
RAM per Node : 56 GB 
Cores per Node: 32 
Instance Type: M4*4xLarge

看来，你已经有了6×56 GB = 336 GB和6×32个核心= 192芯。

以下命令后：

火花-SQL --master纱--num-执行人12 --executor存储器20G --executor型磁芯15 --driver存储器10G -f hql1 .hql

您已经预留了240 GB和180个核心，这是超过一半的可用资源，仅用于第一个spark-sql。

我认为这个问题是跟单&这使spark-sql背景并给予你已经有了25 spark-sql你看问题有缺失的资源。我并不感到惊讶。

来源

2017-04-20 12:36:20

将spark动态内存分配更改为false应该可以解决问题。

尽管我们在命令中设置了执行程序内存，但如果资源在群集中可用，spark会动态地分配内存。要将内存使用限制为只执行程序内存，spark动态内存分配参数应设置为false。

您可以直接在spark配置文件中将其更改，或作为配置参数传递给命令。

spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G --conf spark.dynamicAllocation.enabled=false -f hive1.hql

来源

2017-04-20 16:43:12

为什么在使用spark-sql并行执行多个Hive脚本时Spark作业失败？

回答

相关问题