2017-06-02 77 views
0

我正尝试使用EMR w/Spark在AWS中启动群集。我有一个bash bootstrap脚本来安装一些python软件包,下载凭据并应用一些配置。引导操作在主服务器上成功,但在从服务器上失败。错误的唯一提示是“i - #####:启动失败,bootstrap操作2失败,出现非零退出代码”。紧接在它之前的消息是“我 - #####:引导操作1已完成”。 (在这两种情况下都指的是从站的实例ID,主站也报告引导操作1的成功)。节点配置程序中的EMR从站引导失败引导操作成功后

所以它看起来像在引导操作2中执行的最后一个命令有一个错误,并导致引导脚本返回非零退出代码。但是,我只配置了一个引导程序操作。非主节点是否有另一个自动运行的引导操作?

没有日志显示实际的错误是什么。我查看了S3上的引导日志(不能可靠地显示出来),并试图在启动过程中在从服务器和主服务器上拖拽/ var/log/bootstrap-actions/logs。

我很确定这个错误不在我的脚本中(每个开发者都说过......)。我可以创建一个无引导的vanilla EMR集群,然后在等待时登录,然后以用户hadoop运行我的引导脚本(无错误)。我还检查了最后几个命令(grep和echo),并验证它们不会返回非零出口,也不会使脚本返回非零的退出代码。

我觉得这个问题一定是在一些神秘的第二引导行动。是这样吗?我如何确定错误?

UPDATE 我在启动时登录到从属节点。我在/emr/instance-controller/lib/bootstrap-actions找到引导行动。只有1个子文件夹,它包含我的引导脚本。然后我跑 tail -f /emr/instance-controller/log/instance-controller.log。我证实我的脚本启动了。经过约15状态检查(15分钟)的周期,我看到

2017-06-02 13:44:30,173 INFO InstanceConfigurer: Script 1 - Execution succeeded 

然后我看到另一个AWS脚本启动,这似乎是失败的人。

2017-06-02 13:44:30,181 INFO InstanceConfigurer: Running provision-node, with id 5aed1c54-4210-4387-944a-4fdbbce6dc8d 
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Fetching file '/var/lib/aws/emr/provision-node' 
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - startExec '/var/lib/aws/emr/provision-node' 
2017-06-02 13:44:30,189 INFO InstanceConfigurer: startExec '/var/lib/aws/emr/provision-node' 
2017-06-02 13:44:30,190 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Environment: 
... 
2017-06-02 13:44:54,201 INFO InstanceConfigurer: Output from command '/var/lib/aws/emr/provision-node': 
stdout: 
stderr: 

2017-06-02 13:44:54,202 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node 
2017-06-02 13:44:54,202 INFO InstanceConfigurer: waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node 
2017-06-02 13:44:54,203 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - total process run time: 24 seconds 
2017-06-02 13:44:54,203 INFO InstanceConfigurer: total process run time: 24 seconds 
2017-06-02 13:44:54,217 ERROR InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Execution for /var/lib/aws/emr/provision-node failed with code '255' 
2017-06-02 13:44:54,219 ERROR InstanceConfigurer: Startup failed with 
aws157.instancecontroller.common.model.InstanceConfiguratorException: Source: PROVISION_NODE | ErrorCode: SCRIPT_EXECUTION_FAILED_CODE | Execution for /var/lib/aws/emr/provision-node failed with code '255' 
    at aws157.instancecontroller.common.InstanceConfigurator.runScript(InstanceConfigurator.java:563) 
    at aws157.instancecontroller.common.InstanceConfigurator.provisionNode(InstanceConfigurator.java:225) 
    at aws157.instancecontroller.common.InstanceConfigurator.doDistributionConfigure(InstanceConfigurator.java:201) 
    at aws157.instancecontroller.common.InstanceConfigurator.access$200(InstanceConfigurator.java:70) 
    at aws157.instancecontroller.common.InstanceConfigurator$1.run(InstanceConfigurator.java:251) 

我不熟悉/var/lib/aws/emr/provision-node剧本,但其唯一内容是

#!/bin/bash 
set -ex 

sudo /usr/share/aws/emr/node-provisioner/bin/provision-node "[email protected]" 

看着/usr/share/aws/emr/node-provisioner/bin/provision-node,我可以看到这个剧本做了一堆的工作,以确定路径$ EMR_NODE_PROVISIONER_HOME,然后从那里

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_

我想通次运行下面的Java类通过查看供应节点脚本的来源并单独运行。我一直无法实时查看日志或失败,以查看出了什么问题。当我分开运行它时,我得到以下异常。但我认为这是因为我传递了垃圾数据而不是UUID(我不知道UUID来自哪里,而且每个从机的启动都不相同)。

2017-06-02 14:55:13,593 ERROR main: Encountered a problem while provisioning 
java.net.SocketTimeoutException: Read timed out 
    at java.net.SocketInputStream.socketRead0(Native Method) 
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
    at java.net.SocketInputStream.read(SocketInputStream.java:171) 
    at java.net.SocketInputStream.read(SocketInputStream.java:141) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) 
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) 
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569) 
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) 
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) 
    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37) 
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94) 
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) 
    at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49) 
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38) 
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31) 
    at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32) 
    at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51) 
    at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21) 
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20) 

所以我现在的问题是什么是com.amazonaws.emr.node.provisioner.Program,为什么它没有(或者我怎么找出为什么?)?

UPDATE 2

我到尾管的/ usr /共享/ AWS/EMR /节点供应者/ bin中/提供节点的输出一路出现故障,其结果是一样的在我的独立在上面跑。

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop 
2017-06-02 17:05:37,869 ERROR main: Encountered a problem while provisioning 
java.net.SocketTimeoutException: Read timed out 
    at java.net.SocketInputStream.socketRead0(Native Method) 
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
    at java.net.SocketInputStream.read(SocketInputStream.java:171) 
    at java.net.SocketInputStream.read(SocketInputStream.java:141) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345) 
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) 
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) 
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569) 
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474) 
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) 
    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37) 
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94) 
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) 
    at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49) 
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38) 
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31) 
    at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32) 
    at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51) 
    at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21) 
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20) 

我猜这可能是一个防火墙/安全组的问题,但我使用EMR生成的默认安全组,所以我期望的端口是开放的。我在VPC的私有子网中构建此群集,这可能是一个问题。但是,当我构建没有引导的集群时,我不会遇到这种故障。我的下一个调试步骤是构建一个不带bootstrapping的vanilla群集,并观察这个相同的命令。

UPDATE 3 确认具有Spark部署的vanilla EMR在无网络更改时成功。/usr/share/aws/emr/node-provisioner/bin/provision-node中没有错误。启动java命令后,stderr的下一行显示平台配置参数的JSON转储。但是,stdout显示了从回购Bigtop安装的yum软件包。我没有在脚本或stderr输出中看到yum命令(来自set -xe),所以我认为yum命令必须在该Java程序中。不知道他们为什么在这里成功,但没有引导行动。

我的私人VPC确实有一个S3端点瓦特/子网的路由和防火墙规则允许访问端点的plist。我的启动脚本能够使用yum(不是来自Bigtop仓库)成功安装软件包,从S3复制文件,并从互联网上的外部git仓库下载代码。

回答

0

我的引导脚本正在运行yum更新。当我发表评论时,我能够通过provision-node脚本并最终使群集进入等待状态。其中一个更新肯定会造成某种冲突或其他问题。我不知道哪一个。目前,我只是要避免运行yum更新。

这是yum日志。我猜这不是R或mysql包中的一个。也许是java,kernel,aws或者util-linux?

Installed: 
    kernel.x86_64 0:4.9.27-14.31.amzn1 

Updated: 
    R.x86_64 0:3.3.3-1.51.amzn1 
    R-core.x86_64 0:3.3.3-1.51.amzn1 
    R-core-devel.x86_64 0:3.3.3-1.51.amzn1 
    R-devel.x86_64 0:3.3.3-1.51.amzn1 
    R-java.x86_64 0:3.3.3-1.51.amzn1 
    R-java-devel.x86_64 0:3.3.3-1.51.amzn1 
    aws-amitools-ec2.noarch 0:1.5.13-0.2.amzn1 
    aws-cli.noarch 0:1.11.83-1.46.amzn1 
    java-1.8.0-openjdk.x86_64 1:1.8.0.131-2.b11.30.amzn1 
    java-1.8.0-openjdk-devel.x86_64 1:1.8.0.131-2.b11.30.amzn1 
    java-1.8.0-openjdk-headless.x86_64 1:1.8.0.131-2.b11.30.amzn1 
    libRmath.x86_64 0:3.3.3-1.51.amzn1 
    libRmath-devel.x86_64 0:3.3.3-1.51.amzn1 
    libblkid.x86_64 0:2.23.2-33.28.amzn1 
    libmount.x86_64 0:2.23.2-33.28.amzn1 
    libuuid.x86_64 0:2.23.2-33.28.amzn1 
    mysql-config.x86_64 0:5.5.56-1.17.amzn1 
    mysql55.x86_64 0:5.5.56-1.17.amzn1 
    mysql55-devel.x86_64 0:5.5.56-1.17.amzn1 
    mysql55-libs.x86_64 0:5.5.56-1.17.amzn1 
    ntp.x86_64 0:4.2.6p5-44.34.amzn1 
    ntpdate.x86_64 0:4.2.6p5-44.34.amzn1 
    python27-botocore.noarch 0:1.5.46-1.63.amzn1 
    python27-jmespath.noarch 0:0.9.2-1.12.amzn1 
    util-linux.x86_64 0:2.23.2-33.28.amzn1 

深入分析欢迎。否则,继续实际让我的代码运行。