2017-09-15 78 views
1

我有这样一个数据帧一个UDF:如何创建,创建一个新的列,并修改现有列

id | color 
---| ----- 
1 | red-dark 
2 | green-light 
3 | red-light 
4 | blue-sky 
5 | green-dark 

我想创建一个UDF这样,我的数据框变为:

id | color | shade 
---| ----- | ----- 
1 | red | dark 
2 | green | light 
3 | red | light 
4 | blue | sky 
5 | green | dark 

我写了一个UDF此:

def my_function(data_str): 
    return ",".join(data_str.split("-")) 

my_function_udf = udf(my_function, StringType()) 

#apply the UDF 

df = df.withColumn("shade", my_function_udf(df['color'])) 

不过,我想让它成为这个不改变数据帧。相反,它把它变成:

id | color  | shade 
---| ---------- | ----- 
1 | red-dark | red,dark 
2 | green-dark | green,light 
3 | red-light | red,light 
4 | blue-sky | blue,sky 
5 | green-dark | green,dark 

我该如何转换数据帧,因为我希望它在pyspark?

,尝试了建议的问题

schema = ArrayType(StructType([ 
    StructField("color", StringType(), False), 
    StructField("shade", StringType(), False) 
])) 

color_shade_udf = udf(
    lambda s: [tuple(s.split("-"))], 
    schema 
) 

df = df.withColumn("colorshade", color_shade_udf(df['color'])) 

#Gives the following 

id | color  | colorshade 
---| ---------- | ----- 
1 | red-dark | [{"color":"red","shade":"dark"}] 
2 | green-dark | [{"color":"green","shade":"dark"}] 
3 | red-light | [{"color":"red","shade":"light"}] 
4 | blue-sky | [{"color":"blue","shade":"sky"}] 
5 | green-dark | [{"color":"green","shade":"dark"}] 

我觉得我越来越近

+0

@火花卫生学习现在只需做另一个'.withColumn(“color”,“colorshade.color”)“+用于遮蔽相似的+'dropColumn(”colorshade“)' –

回答

2

您可以使用内置的功能split()

from pyspark.sql.functions import split, col 

df.withColumn("arr", split(df.color, "\\-")) \ 
    .select("id", 
      col("arr")[0].alias("color"), 
      col("arr")[1].alias("shade")) \ 
    .drop("arr") \ 
    .show() 
+---+-----+-----+ 
| id|color|shade| 
+---+-----+-----+ 
| 1| red| dark| 
| 2|green|light| 
| 3| red|light| 
| 4| blue| sky| 
| 5|green| dark| 
+---+-----+-----+