2017-08-06 114 views
0

我有一个DataFrame包含由VectorAssembler创建的特征向量,它也包含空值。我现在想用一个载体来代替空值:火花填充DataFrame与矢量为null

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0) 

df.na.fill(nil) // does not work. 

什么是做到这一点的正确方法?

编辑: 我发现要归功于回答道:

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0) 

import sc.implicits._ 
var fill = Seq(Tuple1(nil)).toDF("replacement") 

val dates = data.schema.fieldNames.filter(e => e.contains("1")) 

data = data.crossJoin(broadcast(fill)) 
for(e <- dates){ 
    data = data.withColumn(e, coalesce(data.col(e), $"replacement")) 
} 
data = data.drop("replacement") 

回答

1

如果问题增加了一些额外的行创建你加入与更换:

import org.apache.spark.sql.functions._ 

val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector") 
val fill = Seq(Tuple1(nil)).toDF("replacement") 

df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")