2015-07-28 114 views
2

我们正在使用Spark(使用PySpark),并且在具有Java版本“1.8.0_45”的Ubuntu Server 14.04 LTS虚拟机的VMware ESX 5.5环境中遇到问题。运行简单的PySpark示例失败

运行一个简单的sc.parallelize(['2', '4']).collect()结果是:

15/07/28 10:11:42 INFO SparkContext: Starting job: collect at <stdin>:1 
15/07/28 10:11:42 INFO DAGScheduler: Got job 0 (collect at <stdin>:1) with 2 output partitions (allowLocal=false) 
15/07/28 10:11:42 INFO DAGScheduler: Final stage: ResultStage 0(collect at <stdin>:1) 
15/07/28 10:11:42 INFO DAGScheduler: Parents of final stage: List() 
15/07/28 10:11:42 INFO DAGScheduler: Missing parents: List() 
15/07/28 10:11:42 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:396), which has no missing parents 
15/07/28 10:11:42 INFO TaskSchedulerImpl: Cancelling stage 0 
15/07/28 10:11:42 INFO DAGScheduler: ResultStage 0 (collect at <stdin>:1) failed in Unknown s 
15/07/28 10:11:42 INFO DAGScheduler: Job 0 failed: collect at <stdin>:1, took 0,058933 s 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/opt/spark/spark/python/pyspark/rdd.py", line 745, in collect 
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
    File "/opt/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ 
    File "/opt/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
java.lang.reflect.Constructor.newInstance(Constructor.java:422) 
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) 
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) 
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:80) 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) 
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) 
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256) 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:884) 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) 
    at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 

找到有关相同的行为这个问题:https://issues.apache.org/jira/browse/SPARK-9089

发生了什么事的任何想法?或者我们可以尝试什么?

回答

1

正如问题说:

我们面临着同样的问题和挖掘,并有很多的运气 后,我们已经找到了问题的根源。

这是由于snappy-java将本地库提取为 java.io.tempdir(默认为/ tmp)并将可执行标志设置为 提取的文件而引起的。如果使用“noexec”选项挂载/ tmp, snappy-java将无法设置可执行标志,并会引发异常。请参阅SnappyLoader.java代码。

我们在安装 /tmp时未使用“noexec”选项修复了此问题。

肖恩·欧文。如果你想重现该问题,安装/ tmp目录“NOEXEC” 选项或java.io.tempdir设置与安装有 “NOEXEC”的目录。

也许这将是更好的火花设置属性 org.xerial.snappy.tempdir到spark.local.dir的价值,但没有 防止spark.local.dir可安装为“NOEXEC”还。

/tmp挂载点中删除noexec解决了此问题。