2016-11-15 84 views
0

对于熊猫我有一个代码段是这样的:火花条件替换值的

def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'): 
    df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet 

其中有条件将在数据帧替换值。

试图端口此功能引发

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show 

并没有为我工作了

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show 
warning: there was one feature warning; re-run with -feature for details 
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;; 

即使df.printSchema返回一个字符串和b

什么是错的这里?

编辑

的最小例如:

import java.sql.{ Date, Timestamp } 
case class FooBar(foo:Date, bar:String) 
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate")) 
     .toDF("foo","bar") 
     .withColumn("foo", 'foo.cast("Date")) 
     .as[FooBar] 

myDf.printSchema 
root 
|-- foo: date (nullable = true) 
|-- bar: string (nullable = true) 


scala> myDf.show 
+----------+--------------------+ 
|  foo|     bar| 
+----------+--------------------+ 
|2016-01-01|    first| 
|2016-01-02|    second| 
|  null|  noValidFormat| 
|2016-01-04|lastAssumingSameDate| 
+----------+--------------------+ 

myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show 

而在条件情况下链接的预期输出

+----------+--------------------+ 
|  foo|     bar| 
+----------+--------------------+ 
|2016-01-01|    first| 
|2016-01-02|    second| 
| "noValue"|  noValidFormat| 
|2016-01-04|lastAssumingSameDate| 
+----------+--------------------+ 

EDIT2

需要

df 
    .withColumn("A", 
     when(
     (($"B" === "x") and ($"B" isNull)) or 
     (($"B" === "y") and ($"B" isNull)), "replacement") 

应该工作

+0

请分享示例数据和预期输出 – mtoto

+0

@mtoto请参阅编辑。 –

回答

2

注意运算符的优先级。它应该是:

myDf.withColumn("foo", 
    when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue")) 

此:

$"bar" === "noValidFormat" and $"foo" isNull 

被评价为:

(($"bar" === "noValidFormat") and $"foo") isNull 
+0

奇怪这仍然存在:警告:有四个功能警告;详细信息请使用-feature运行 –

+0

http://www.scala-lang.org/api/current/scala/languageFeature$$postfixOps$.html – 2016-11-15 13:01:01

+0

我会检查一下。 (($“A”===“y”)和($“B”isNull),“R”))时,我会链接多个这样的语句,比如'dfFixedAge .withColumn(“C”, ) .withColumn(“C”, 当(($“A”===“x”)和($“B”isNull),“R”))'只有最后一个持续存在。 –