2016-07-06 73 views
1
创建数据帧时

我有下面的代码,我想创建从一个PipelinedRDD` DataFrame错误从RDD

print type(simulation) 
    sqlContext.createDataFrame(simulation) 

print语句打印此:

<class 'pyspark.rdd.PipelinedRDD'> 

然而,下一行我得到这个错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 

该错误已经此痕迹:

---> 13 sqlContext.createDataFrame(simulation) 

/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 
    421 
    422   if isinstance(data, RDD): 
--> 423    rdd, schema = self._createFromRDD(data, schema, samplingRatio) 
    424   else: 
    425    rdd, schema = self._createFromLocal(data, schema) 

/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio) 
    308   """ 
    309   if schema is None or isinstance(schema, (list, tuple)): 
--> 310    struct = self._inferSchema(rdd, samplingRatio) 

回答

2

看来,架构不能从您的数据推断。 如果您未指定samplingRatio,则只会使用第一行来确定类型。 您应该尝试一个非零采样率或指定架构如下:

schema = StructType([StructField("int_field", IntegerType()), 
        StructField("string_field", StringType())]) 
+0

我得到'NameError:全局名称“StructType”没有defined'。我需要导入任何图书馆吗? – octavian

+1

是的。你需要这个:from pyspark.sql.types import StructType,StructField,StringType,IntegerType – Sorin

+0

你是否试过只指定samplingRatio? – Sorin