2017-07-19 68 views
1

我有一个数据帧,它看起来像下面应用UDF的多个列中的星火据帧

| id| age| rbc| bgr| dm|cad|appet| pe|ane|classification| 
+---+----+------+-----+---+---+-----+---+---+--------------+ 
| 3|48.0|normal|117.0| no| no| poor|yes|yes|   ckd| 
.... 
.... 
.... 

我写了UDF来分类yes, no, poor, normal转换成二进制0s1s

def stringToBinary(stringValue: String): Int = { 
    stringValue match { 
     case "yes" => return 1 
     case "no" => return 0 
     case "present" => return 1 
     case "notpresent" => return 0 
     case "normal" => return 1 
     case "abnormal" => return 0 
    } 
} 

val stringToBinaryUDF = udf(stringToBinary _) 

我申请这到数据框如下

val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value 
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original 

如何将多个列传递到UDF中,以便我不必为其他分类列重复自己?

回答

5

udf功能不应该是选择,如果你有spark功能做同样的工作udf函数会对列数据进行序列化和反序列化。

给出一个dataframe作为

+---+----+------+-----+---+---+-----+---+---+--------------+ 
|id |age |rbc |bgr |dm |cad|appet|pe |ane|classification| 
+---+----+------+-----+---+---+-----+---+---+--------------+ 
|3 |48.0|normal|117.0|no |no |poor |yes|yes|ckd   | 
+---+----+------+-----+---+---+-----+---+---+--------------+ 

可以实现与when功能您的要求为

import org.apache.spark.sql.functions._ 
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1)) 
    .otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column)) 

df.withColumn("dm", applyFunction(col("dm"))) 
    .withColumn("cad", applyFunction(col("cad"))) 
    .withColumn("rbc", applyFunction(col("rbc"))) 
    .withColumn("pe", applyFunction(col("pe"))) 
    .withColumn("ane", applyFunction(col("ane"))) 
    .show(false) 

结果是

+---+----+---+-----+---+---+-----+---+---+--------------+ 
|id |age |rbc|bgr |dm |cad|appet|pe |ane|classification| 
+---+----+---+-----+---+---+-----+---+---+--------------+ 
|3 |48.0|1 |117.0|0 |0 |poor |1 |1 |ckd   | 
+---+----+---+-----+---+---+-----+---+---+--------------+ 

现在清楚的问题说,你不”不想重复所有列的过程您可以执行以下操作:

val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns 

var tempdf = df 
columnsTomap.map(column => { 
    tempdf = tempdf.withColumn(column, applyFunction(col(column))) 
}) 

tempdf.show(false) 
+0

你的问题@Giridhar是什么?为什么你多次接受和不接受答案?如果答案帮助你,然后接受另外的评论。 :) –

0

A UDF可以采取许多参数,即许多列,但它应该返回一个结果,即一列。

为了做到这一点,只需将参数添加到您的stringToBinary函数中即可。

它你希望它采取两列它看起来就像这样:

def stringToBinary(stringValue: String, secondValue: String): Int = { 
stringValue match { 
    case "yes" => return 1 
    case "no" => return 0 
    case "present" => return 1 
    case "notpresent" => return 0 
    case "normal" => return 1 
    case "abnormal" => return 0 
} 
} 
val stringToBinaryUDF = udf(stringToBinary _) 

希望这有助于

+0

如果它需要数组'def stringToBinary(stringValues:Array [String])''stringValues [0]'表示什么? –

+0

它肯定会代表'Array'中的第一个'String',它表示在** UDF **中传递的第一个'Column'。另一种选择是使用'*'来引用多个'String'类型的参数,并像平时一样用逗号分隔的符号传递参数。定义将看起来像'def stringToBinary(stringValues:String *)' –