Pyspark - 从Str到Int铸造多列

我试图使用PySpark 2.1.0将多个String列转换为数据框中的整数。该数据集是开始RDD，当它生成以下错误数据帧创建：Pyspark - 从Str到Int铸造多列

TypeError: StructType can not accept object 3 in type <class 'int'>

的什么，我试图做一个样本：

import pyspark.sql.types as typ 
from pyspark.sql.functions import * 

labels = [ 
    ('A', typ.StringType()), 
    ('B', typ.IntegerType()), 
    ('C', typ.IntegerType()), 
    ('D', typ.IntegerType()), 
    ('E', typ.StringType()), 
    ('F', typ.IntegerType()) 
] 

rdd = sc.parallelize(["1", 2, 3, 4, "5", 6]) 
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels]) 
df = spark.createDataFrame(rdd, schema) 
df.show() 

cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string'] 
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast)) 

df2 = df.select(*(df[dt[0]].cast("integer").alias(dt[0]) 
         for dt in df.dtypes if dt[1]=='string')) 

df2.show()

开始与问题数据帧是不是基于RDD创建的。此后，我尝试了两种投射方式（df2），第一种是注释掉的。

有什么建议吗？或者，无论如何，我可以使用.withColumn函数来转换所有列在1去，而不是指定每列？实际数据集虽然不大，但有很多列。

来源

2017-04-24 alortimor

你可以'映射到您的铸造列的新RDD –

问题不在于你的代码，它是你的数据。您正在传递单个列表，该列表将被视为单列而不是您想要的六列。。

尝试RDD如下行，它应该工作的罚款（约名单通知额外的支架） -

rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])

你上面更正行的代码显示了我下面的输出：

+---+---+---+---+---+---+ 
| A| B| C| D| E| F| 
+---+---+---+---+---+---+ 
| 1| 2| 3| 4| 5| 6| 
+---+---+---+---+---+---+ 

+---+---+ 
| A| E| 
+---+---+ 
| 1| 5| 
+---+---+

来源

2017-04-24 15:54:59 Pushkr

Pyspark - 从Str到Int铸造多列

回答

相关问题