我想通过替换他们的意思来清除缺少的值。这个源代码用于工作我不为什么,它现在不工作。任何帮助将不胜感激。 这里是集我使用干净缺失值火花与聚合函数
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe
0,,72,160,5,,2.9421,,3,4
1,54,70,,5,0.6301,2.7273,,3,
2,,51,164,5,,2.9834,,3,4
3,,74,170,5,0.6966,2.9654,2.3699,3,4
4,108,62,,5,0.6087,2.7093,2.1619,3,4
这里我做了什么
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/data_with_missing_values3.csv")
df.show(false)
var newDF = df
df.dtypes.foreach { x =>
val colName = x._1
newDF = newDF.na.fill(df.agg(max(colName)).first()(0).toString, Seq(colName))
}
newDF.show(false)
下面是结果,什么都没有发生
initial_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
new_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
我应该做的
是否要用最大值或平均值替换空值。你已经询问了平均值和你使用的代码示例max? –