0
我需要更新delta数据的数据框的row_number列。为Delta数据更新Spark Dataframe的窗口函数row_number列
我已经实现如下的基本负载的ROW_NUMBER:
输入数据:
val base = List(List("001", "a", "abc"), List("001", "a", "123"),List("003", "c", "456") ,List("002", "b", "dfr"), List("003", "c", "ytr"))
.map(row => (row(0), row(1), row(2)))
val DS1 = base.toDF("KEY1", "KEY2" ,"VAL")
DS1.show()
+----+----+---+
|KEY1|KEY2|VAL|
+----+----+---+
| 001| a|abc|
| 001| a|123|
| 003| c|456|
| 002| b|dfr|
| 003| c|ytr|
+----+----+---+
现在我已经添加使用如下面的窗口函数的ROW_NUMBER:
val baseDF = DS1.select(col("KEY1"), col("KEY2"), col("VAL") ,row_number().over(Window.partitionBy(col("KEY1"), col("KEY2")).orderBy(col("KEY1"), col("KEY2").asc)).alias("Row_Num"))
baseDF.show()
|KEY1|KEY2|VAL|Row_Num|
+----+----+---+-------+
|001 |a |abc|1 |
|001 |a |123|2 |
|002 |b |dfr|1 |
|003 |c |456|1 |
|003 |c |ytr|2 |
+----+----+---+-------+
现在三角洲负荷降至:
val delta = List(List("001", "a", "y45") ,List("002", "b", "444"))
.map(row => (row(0), row(1), row(2)))
val DS2 = delta.toDF("KEY1", "KEY2" ,"VAL")
DS2.show()
+----+----+---+
|KEY1|KEY2|VAL|
+----+----+---+
| 001| a|y45|
| 002| b|444|
+----+----+---+
所以预计更新的结果应该是
baseDF.show()
|KEY1|KEY2|VAL|Row_Num|
+----+----+---+-------+
|001 |a |abc|1 |
|001 |a |123|2 |
| 001| a|y45|3 | -----> Delta record
|002 |b |dfr|1 |
| 002| b|444|2 | -----> Delta record
|003 |c |456|1 |
|003 |c |ytr|2 |
+----+----+---+-------+
任何建议,以实现使用dataframes /数据集,这个解决方案? 我们可以通过spark rdd的zipWithIndex实现上述解决方案吗?与更新的行号码加增量