2017-10-13 100 views
0

我在使用PySpark时遇到困难。 我想读csv文件并做一个热门编码(现在)。文件处理和数据处理中的PySpark错误

我收到一个跨越2页的错误。

我的示例代码如下:

from pyspark.sql import SQLContext 
from pyspark.sql.types import * 
from pyspark import SparkContext 
from pyspark.ml.feature import OneHotEncoder, StringIndexer 

sqlContext = SQLContext(sc) 
shirtsize = sc.textFile("shirt_sizes.*") 
shirt_header = shirtsize.first() 
fields = [StructField(field_name, StringType(),True) for field_name in shirt_header.split(',')] 

fields[0].dataType = IntegerType() 
schema = StructType(fields) 
shirtDF = spark.createDataFrame(shirtsize,schema) 
SI = StringIndexer(inputCol="ethcty",outputCol="ET_out") 
model = SI.fit(shirtDF) 

错误太长,但这是开始:

17/10/13 15:22:38 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 
1.0 (TID 1, cluster-a2db-m.c.internal, executor 1): org.apache.spark.api 
.python.PythonException: Traceback (most recent call last): 
    File "/usr/lib/spark/python/pyspark/worker.py", line 177, in main 
    process() 
    File "/usr/lib/spark/python/pyspark/worker.py", line 172, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/lib/spark/python/pyspark/serializers.py", line 268, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "/usr/lib/spark/python/pyspark/sql/session.py", line 520, in prepare 
    verify_func(obj, schema) 
    File "/usr/lib/spark/python/pyspark/sql/types.py", line 1366, in _verify_type 
    raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj))) 
TypeError: StructType can not accept object u'columnA,columnB,columnC,columnD....' in type <type 'unicode'> 

有一些,我不能够调试

类错误

回答

0

只需使用csv阅读器:

spark.read.schema(schema).format("csv").load(path)\ 

否则,您必须先解析所有行(分割并创建Rows/tuples并将值转换为预期类型)。