应用UDF的多个列中的星火据帧

我有一个数据帧，它看起来像下面应用UDF的多个列中的星火据帧

| id| age| rbc| bgr| dm|cad|appet| pe|ane|classification| 
+---+----+------+-----+---+---+-----+---+---+--------------+ 
| 3|48.0|normal|117.0| no| no| poor|yes|yes|   ckd| 
.... 
.... 
....

我写了UDF来分类yes, no, poor, normal转换成二进制0s和1s

def stringToBinary(stringValue: String): Int = { 
    stringValue match { 
     case "yes" => return 1 
     case "no" => return 0 
     case "present" => return 1 
     case "notpresent" => return 0 
     case "normal" => return 1 
     case "abnormal" => return 0 
    } 
} 

val stringToBinaryUDF = udf(stringToBinary _)

我申请这到数据框如下

val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value 
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original

如何将多个列传递到UDF中，以便我不必为其他分类列重复自己？

来源

2017-07-19 Giridhar Karnik

udf功能不应该是选择，如果你有spark功能做同样的工作udf函数会对列数据进行序列化和反序列化。

给出一个dataframe作为

+---+----+------+-----+---+---+-----+---+---+--------------+ 
|id |age |rbc |bgr |dm |cad|appet|pe |ane|classification| 
+---+----+------+-----+---+---+-----+---+---+--------------+ 
|3 |48.0|normal|117.0|no |no |poor |yes|yes|ckd   | 
+---+----+------+-----+---+---+-----+---+---+--------------+

可以实现与when功能您的要求为

import org.apache.spark.sql.functions._ 
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1)) 
    .otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column)) 

df.withColumn("dm", applyFunction(col("dm"))) 
    .withColumn("cad", applyFunction(col("cad"))) 
    .withColumn("rbc", applyFunction(col("rbc"))) 
    .withColumn("pe", applyFunction(col("pe"))) 
    .withColumn("ane", applyFunction(col("ane"))) 
    .show(false)

结果是

+---+----+---+-----+---+---+-----+---+---+--------------+ 
|id |age |rbc|bgr |dm |cad|appet|pe |ane|classification| 
+---+----+---+-----+---+---+-----+---+---+--------------+ 
|3 |48.0|1 |117.0|0 |0 |poor |1 |1 |ckd   | 
+---+----+---+-----+---+---+-----+---+---+--------------+

现在清楚的问题说，你不”不想重复所有列的过程您可以执行以下操作：

val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns 

var tempdf = df 
columnsTomap.map(column => { 
    tempdf = tempdf.withColumn(column, applyFunction(col(column))) 
}) 

tempdf.show(false)

来源

2017-07-19 12:04:16

你的问题@Giridhar是什么？为什么你多次接受和不接受答案？如果答案帮助你，然后接受另外的评论。 :) –

A UDF可以采取许多参数，即许多列，但它应该返回一个结果，即一列。

为了做到这一点，只需将参数添加到您的stringToBinary函数中即可。

它你希望它采取两列它看起来就像这样：

def stringToBinary(stringValue: String, secondValue: String): Int = { 
stringValue match { 
    case "yes" => return 1 
    case "no" => return 0 
    case "present" => return 1 
    case "notpresent" => return 0 
    case "normal" => return 1 
    case "abnormal" => return 0 
} 
} 
val stringToBinaryUDF = udf(stringToBinary _)

希望这有助于

来源

2017-07-19 11:31:05

如果它需要数组'def stringToBinary（stringValues：Array [String]）''stringValues [0]'表示什么？ –

它肯定会代表'Array'中的第一个'String'，它表示在** UDF **中传递的第一个'Column'。另一种选择是使用'*'来引用多个'String'类型的参数，并像平时一样用逗号分隔的符号传递参数。定义将看起来像'def stringToBinary（stringValues：String *）' –

应用UDF的多个列中的星火据帧

回答

相关问题